BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT = Bidirectional Encoder Representations from Transformers

masked autoencoders presented here?

BERT is a bidirectional Transformer. BERT is not a generative model. It’s an encoder only.

Bert tries to predict the masked token.

They use bidirectional self-attention.

Resources

Variants