Dictionary Encoding

Idea: Replace repeated substrings/tokens with a short reference to a “dictionary entry”.

  • Best when: the data has recurring patterns that can repeat far apart, not necessarily in long consecutive runs.

  • Examples: text, source code, JSON logs, repetitive structured data, network packets.

  • Toy example (token dictionary):
    Data: {"status":"ok"} {"status":"ok"} {"status":"ok"}
    Dictionary: D0 = {"status":"ok"}
    Encoded: D0 D0 D0

  • Real-world: LZ77/LZ78/LZW, DEFLATE (gzip) uses LZ + Huffman.

Pros: Can compress lots of kinds of repetition (not just adjacent).
Cons: Needs dictionary management; if the data doesn’t repeat patterns, overhead can hurt.

It’s very similar to Run-Length Encoding, except that the repeated chunks can be anywhere.