Tokenizer

Demos

Are tokenizers hard coded or learned? Both exist:

  1. Rule-Based (Hard-Coded) Tokenizers
    • Examples: Whitespace tokenizers, simple punctuation-based splitters
  2. Learned (Trained) Tokenizers:

By learned, it's not necessarily a neural network

It’s just doing it on large corpus of text, running some sort of mapreduce job, doing a frequency count of pairs, and then merging those pairs to introduce a new token.

https://watml.github.io/slides/CS480680_lecture11.pdf

Special tokens

  • [CLS] - Prepended to the beginning of the sequence in BERT-style models.
  • [SEP] - Marks the end of a sentence/segment, or separates two segments.
  • [MASK] - For masked language modeling
  • [PAD - to pad so that sequences in a batch are the same length