🛠️ Steven Gong

Search

SearchSearch

Mar 22, 2025, 1 min read

Tokenizer, Byte Pair Encoding

tiktoken

This is what openAI uses. It uses Byte Pair Encoding under the hood. Essentially, you start with some base tokenization scheme will a very small vocabulary (256 tokens, each char is a byte). Then, you iteratively merge the most common tokens.

Gpt-2 has a vocabulary of ~50k different tokens.

Gpt-4 base model uses 100k different tokens.

  • Source: https://www.youtube.com/watch?v=7xTGNNLPyMI

Demo

  • https://tiktokenizer.vercel.app/

What uses tiktoken?

  • GPT-4
  • Llama3

Graph View

Backlinks

  • Byte Pair Encoding (BPE)
  • Tokenizer

Created with Quartz, © 2025

  • Blog
  • LinkedIn
  • Twitter
  • GitHub