Bits Per Byte (BPB)

BPB = loss × (tokens / bytes) / ln(2)

  • loss is the cross-entropy loss (in nats)
  • tokens / bytes adjusts for how many tokens the tokenizer produces per byte of raw text
  • / ln(2) converts from nats to bits

The key advantage over raw loss or perplexity is that BPB is tokenizer-independent — you can fairly compare models that use different vocabularies/tokenization schemes, since it measures how many bits the model needs per byte of raw text.

Lower BPB = better compression = better model. A value of 1.0 means 1 bit per byte of text; random would be 8.0 (since a byte has 8 bits).