Teacher Forcing
Learned this term from the Deep Learning Textbook, even though it was not explicitly mentioned in the transformers paper.
Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output y(t) as input at time t + 1.
`
gold target: <bos> I like eating mushrooms <eos>
decoder_input: <bos> I like eating mushrooms
labels: I like eating mushrooms <eos>