Teacher Forcing

Learned this term from the Deep Learning Textbook, even though it was not explicitly mentioned in the transformers paper.

Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output y(t) as input at time t + 1.

`

gold target:      <bos> I like eating mushrooms <eos>
decoder_input:    <bos> I like eating mushrooms
labels:               I like eating mushrooms <eos>