Tokenizer
Demos
Are tokenizers hard coded or learned? Both exist:
- Rule-Based (Hard-Coded) Tokenizers
- Examples: Whitespace tokenizers, simple punctuation-based splitters
- Learned (Trained) Tokenizers:
- Byte Pair Encoding (used for GPT-4o) → tiktoken
- SentencePiece by Google, a library
By learned, it's not necessarily a neural network
It’s just doing it on large corpus of text, running some sort of mapreduce job, doing a frequency count of pairs, and then merging those pairs to introduce a new token.
https://watml.github.io/slides/CS480680_lecture11.pdf