Teacher Forcing
Learned this term from the Deep Learning Textbook, even though it was not explicitly mentioned in the transformers paper.
“teacher forcing” refers to a technique used to train recurrent neural networks (RNNs) where the actual, observed output from the previous time step is fed as input to the network for predicting the current time step’s output, rather than using the network’s own prediction from the previous state.
Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output y(t) as input at time t + 1.
gold target: <bos> I like eating mushrooms <eos>
decoder_input: <bos> I like eating mushrooms
labels: I like eating mushrooms <eos>
Also see mention of this in World Models paper.
It seems very common in any sequential problem.