Parameter Golf Challenge
Challenge by OpenAI. Im going to hyper-obsess over this.
The main thing is achieving good compression.
Some side ideas
Thereās this talk from Ilya Sutksokover about An Observation on Generalization which talks about LLMs are really just these glorified compression machines, but the reason they work so well is because the objective itself is cross-entropy.
There are 2 main axes:
- Speeding up training (since we are limited to training for 10 mins on 8xh100 cluster)
- Improving model architecture to get better compression while remaining under 16MB
Running the baseline train_gpt.py on my RTX5090:
RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py step:967/20000 val_loss:2.3008 val_bpb:1.3627 train_time:600152ms step_avg:620.63ms
stopping_early: wallclock_cap train_time:600152ms step:967/20000
peak memory allocated: 10255 MiB reserved: 10834 MiB
Serialized model: 67224983 bytes
Code size: 47686 bytes
Total submission size: 67272669 bytes
Serialized model int8+zlib: 12225991 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
Total submission size int8+zlib: 12273677 bytes
final_int8_zlib_roundtrip val_loss:2.3043 val_bpb:1.3647 eval_time:21564ms
final_int8_zlib_roundtrip_exact val_loss:2.30429686 val_bpb:1.36473439Change #1: Sliding Window
This is a common eval idea popularized by the huggingface blog https://huggingface.co/docs/transformers/perplexity#calculating-ppl-with-fixed-length-models.
The problem: Given a long sequence of tokens, we want to evaluate the perforamnce on our LLM. However, our models can only accept a maximum context window size, so we need to split the sequence up.
Naively, we can chunk this up into non-overlapping sequences, and evaluate each sequence individually. For example for a max context size of 1024 tokens, we would:
- Feed tokensĀ
[0...1023], and score positionsĀ[0...1023] - Feed tokensĀ
[1024...2047], and score positionsĀ[1024...2047] - Feed tokensĀ
[2048...3071], and score positionsĀ[2048...3071]

You can see that the average length of the context window that each token gets scored on is sequence length / 2 (tokens predicted at the beginning of sequence get no context, tokens predicted at end of sequence get all the context)
However, we can do better. position 1024 should ideally be scored on from getting the full context from tokens [1...1024], however in this setup it only gets context from a single token, at position[1024] (since it is the beginning of the sequence).
We can address this by using a sliding window.
- Feed tokensĀ
[0...1023], score positionsĀ[0...1023] - Feed tokensĀ
[1...1024], but only score positionsĀ[1024]Ā - Feed tokensĀ
[2...1025], but only score positionsĀ[1025] - etc.

In this setup, we just throws away the loss for the early positions.
For this challenge, people have been using a stride of 64.
The naive implementation uses window = 1024, stride = 1024, and took ~20s
Why stride = 64?
Ideally, we use a stride of so that
- Feed tokensĀ
[0...1023], but only score positionsĀ[960...1023]Ā (the last 64) - Feed tokensĀ
[64...1087], but only score positionsĀ[1024...1087]Ā (the last 64) - Feed tokensĀ
[128...1151], but only score positionsĀ[1088...1151]Ā (the last 64)
Change #2: Fp8 Training
The original model