🛠️ Steven Gong

Search

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Variants

Feb 11, 2026, 1 min read

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT = Bidirectional Encoder Representations from Transformers

masked autoencoders presented here?

BERT is a bidirectional Transformer. BERT is not a generative model. It’s an encoder only.

Bert tries to predict the masked token.

They use bidirectional self-attention.

Resources

Generative Pre-Trained Transformer (GPT)
Mask Token
Natural Language Processing
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Dense Passage Retrieval for Open-Domain Question Answering
Masked Autoencoders Are Scalable Vision Learners
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
VideoBERT: A Joint Model for Video and Language Representation Learning