Bits Per Byte (BPB)

BPB = loss × (tokens / bytes) / ln(2)

loss is the cross-entropy loss (in nats)
tokens / bytes adjusts for how many tokens the tokenizer produces per byte of raw text
/ ln(2) converts from nats to bits

The key advantage over raw loss or perplexity is that BPB is tokenizer-independent — you can fairly compare models that use different vocabularies/tokenization schemes, since it measures how many bits the model needs per byte of raw text.

Lower BPB = better compression = better model. A value of 1.0 means 1 bit per byte of text; random would be 8.0 (since a byte has 8 bits).

Perplexity

🛠️ Steven Gong

Table of Contents

Bits Per Byte (BPB)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Bits Per Byte (BPB)

Related

Graph View

Backlinks