Large Language Models (LLM)

Large Language Models are Transformer-based language models scaled to hundreds of billions of parameters, trained on internet-scale text.

The progression BERT (2018) → GPT-3 (2020) → ChatGPT (2022) shows how bigger models, more data, and alignment post-training transformed NLP.

BERT (Google, Oct 2018), Encoder-Only

Architecture: encoder-only Transformer, 340M parameters. Special tokens [CLS] (classification aggregate) and [SEP] (separator).

Pre-training objectives (self-supervised):

Masked Language Modeling: predict masked-out tokens from bidirectional context. Static masking (fixed mask per example).
Next Sentence Prediction: given two sentences, predict whether the second follows the first.

Fine-tuning: replace the MLM/NSP head with a task-specific head; continue training on labeled data (e.g., SQuAD).

Data: BooksCorpus + Wikipedia.

Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

RoBERTa (Facebook/UW, Jul 2019)

“Robustly Optimized BERT”, same architecture, better training recipe:

Remove NSP objective (didn’t help)
Dynamic masking (re-mask each epoch)
10× more data (16 GB → 160 GB: CC-News, OpenWebText, Stories)
15× more tokens (500k steps × batch 8000 vs BERT’s 1M × 256)

Outperforms BERT on GLUE. Paper: RoBERTa.

GPT Family, Decoder-Only

Pre-training objective: next-token prediction (maximum likelihood on $P (x_{i} ∣ x_{< i})$ ). Autoregressive, unlike BERT’s bidirectional MLM.

Intuition

All you’re doing is predicting the next token given everything before. The surprising thing is that scaling this one loss produces reasoning, translation, code, explanations: emergence from compression under a pure likelihood objective. To predict the next word of a physics proof well, the model effectively has to learn physics.

Model	Year	Params	Data	Note
GPT-1	2018	117M	BooksCorpus	Pretrain + task fine-tuning
GPT-2	2019	1.5B	WebText (40 GB)	Zero-shot, “English: Hello. French: ”
GPT-3	2020	175B	Filtered CC (570 GB) + WebText2, Books, Wiki	In-context learning
ChatGPT	2022	~GPT-3.5	+ RLHF	Consumer breakthrough

Papers: GPT-1, GPT-2, GPT-3.

Evaluation Metrics

Perplexity: exponentiated average negative log-likelihood, $PPL = exp (- \frac{1}{N} \sum_{i = 1}^{N} lo g P (x_{i} ∣ x_{< i}))$ Lower is better. Geometric-mean inverse probability of tokens.

Read it as “the effective branching factor.” PPL = 20 means the model is about as confused as if it had to pick uniformly among 20 equally-likely next tokens at every step. PPL = 1 is perfect prediction; PPL = vocab size is total ignorance.

Benchmarks: GLUE (BERT era), SuperGLUE, LAMBADA (last-word prediction), MMLU, HELM.

Scaling Laws (Chinchilla, 2022)

Three levers: parameter count $N$ , dataset size $D$ (tokens), compute $C$ (FLOPs). Rule of thumb: $C \approx 6 N D$ (Each token through each parameter: 2 FLOPs forward + 4 FLOPs backward.)

For fixed $C$ , which $(N, D)$ minimizes loss?

Kaplan et al. 2020: $N \propto C^{0.73}$ , $D \propto C^{0.27}$ → scale parameters faster than data
Chinchilla (Hoffmann et al. 2022): $N \propto C^{0.5}$ , $D \propto C^{0.5}$ → scale equally. Corrects Kaplan, most prior models were undertrained

You have a fixed FLOPs budget. A huge model that’s barely trained wastes parameters it never updated; a small model on infinite data plateaus because it lacks capacity. Chinchilla says: the sweet spot is equal scaling, roughly 20 tokens per parameter.

Takeaway: train a smaller model on more data. Paper: Training Compute-Optimal LLMs.

Chain of Thought (Kojima et al. 2022)

Prepending “Let’s think step by step” to the prompt makes the model produce intermediate reasoning steps, and get more answers right. Zero-shot CoT. Later: few-shot CoT with reasoning demonstrations. Paper: Large Language Models are Zero-Shot Reasoners.

A forward pass is fixed compute per token. Reasoning problems need more compute than one token affords, so the model cheats: it writes its scratch work into the context and then reads it back. Each intermediate step is extra serial compute.

Instruction Tuning

Problem: pretrained models continue text, don’t answer questions.

Prompt: “Write a poem about ML.”
Raw GPT generation: “Write a short story about data science. Write an essay about neural networks.” (continues the list!)

Fix: fine-tune on (instruction, response) pairs: $max_{θ} \sum_{i = 1}^{N} lo g P_{θ} (y_{i} ∣ x, y_{< i})$

FLAN (Google, 2021): instruction-tune T5 on many tasks in instruction format → generalizes to unseen tasks. Paper: FLAN.

RLHF (InstructGPT → ChatGPT)

Problem: instruction-tuning makes the model respond to instructions, but not necessarily in the way humans want (safety, helpfulness, honesty).

Three-step pipeline:

Supervised fine-tuning (SFT): fine-tune on human-written (prompt, response) pairs
Reward model: for each prompt, sample $K$ completions, have humans rank them, derive $(2 K)$ pairwise comparisons. Train $r_{θ} (x, y)$ : $L (θ) = - \frac{1}{( 2 K )} E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))]$ where:
- $y_{w}$ is the winning response
- $y_{l}$ is the losing response
PPO (Proximal Policy Optimization) to maximize reward while staying close to the SFT model: $objective (ϕ) = E_{(x, y) \sim D_{π_{ϕ}^{R L}}} [r_{θ} (x, y) - β lo g \frac{π _{ϕ}^{R L} ( y ∣ x )}{π _{ϕ}^{SFT} ( y ∣ x )}] + γ E_{x \sim D_{pretrain}} [lo g π_{ϕ}^{R L} (x)]$ The KL term is the “alignment tax”; it prevents the RL model from drifting too far from coherent language

Intuition

The reward model learns to imitate human preference from pairwise rankings (Bradley-Terry: $P (y_{w} ≻ y_{l}) = σ (r (y_{w}) - r (y_{l}))$ ). PPO then treats the model as a policy and climbs that reward, while the KL penalty tethers it to the SFT language model so it doesn’t collapse into reward-hacking gibberish. Without the tether, the policy finds the shortest token sequence that scores high and forgets how to write.

Paper: InstructGPT.

Slides from CS480 lec19.

Models:

Gemma
GPT / BERT / RoBERTa / Chinchilla (see above)

How many epochs? Can be only 1, LLMs typically train in the “high-data” regime where each token is seen once. See https://www.reddit.com/r/LocalLLaMA/comments/1ae0uig/how_many_epochs_do_you_train_an_llm_for_in_the/

🛠️ Steven Gong

Table of Contents

Large Language Models (LLM)

BERT (Google, Oct 2018), Encoder-Only

RoBERTa (Facebook/UW, Jul 2019)

GPT Family, Decoder-Only

Evaluation Metrics

Scaling Laws (Chinchilla, 2022)

Chain of Thought (Kojima et al. 2022)

Instruction Tuning

RLHF (InstructGPT → ChatGPT)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Large Language Models (LLM)

BERT (Google, Oct 2018), Encoder-Only

RoBERTa (Facebook/UW, Jul 2019)

GPT Family, Decoder-Only

Evaluation Metrics

Scaling Laws (Chinchilla, 2022)

Chain of Thought (Kojima et al. 2022)

Instruction Tuning

RLHF (InstructGPT → ChatGPT)

Related Concepts

Related

Graph View

Backlinks