🛠️ Steven Gong

Search

Tokenizer
Related

Jul 07, 2025, 1 min read

Tokenizer

Demos

https://tiktokenizer.vercel.app/
https://platform.openai.com/tokenizer

Are tokenizers hard coded or learned? Both exist:

~~Rule-Based (Hard-Coded) Tokenizers~~
- Examples: Whitespace tokenizers, simple punctuation-based splitters
Learned (Trained) Tokenizers:
- Byte Pair Encoding (used for GPT-4o) → tiktoken
- SentencePiece by Google, a library

By learned, it's not necessarily a neural network

It’s just doing it on large corpus of text, running some sort of mapreduce job, doing a frequency count $O (n^{2})$ of pairs, and then merging those pairs to introduce a new token.

Source: https://www.youtube.com/watch?v=7xTGNNLPyMI

https://watml.github.io/slides/CS480680_lecture11.pdf

Related

Embedding

Graph View

Backlinks

Byte Pair Encoding (BPE)
Embedding
SentencePiece
tiktoken

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub