Tokenizer

Demos

Are tokenizers hard coded or learned? Both exist:

  1. Rule-Based (Hard-Coded) Tokenizers
    • Examples: Whitespace tokenizers, simple punctuation-based splitters
  2. Learned (Trained) Tokenizers:

By learned, it's not necessarily a neural network

It’s just doing it on large corpus of text, running some sort of mapreduce job, doing a frequency count of pairs, and then merging those pairs to introduce a new token.

https://watml.github.io/slides/CS480680_lecture11.pdf