Dictionary Encoding

Idea: Replace repeated substrings/tokens with a short reference to a “dictionary entry”.

Best when: the data has recurring patterns that can repeat far apart, not necessarily in long consecutive runs.
Examples: text, source code, JSON logs, repetitive structured data, network packets.
Toy example (token dictionary):
Data: {"status":"ok"} {"status":"ok"} {"status":"ok"}
Dictionary: D0 = {"status":"ok"}
Encoded: D0 D0 D0
Real-world: LZ77/LZ78/LZW, DEFLATE (gzip) uses LZ + Huffman.

Pros: Can compress lots of kinds of repetition (not just adjacent).
Cons: Needs dictionary management; if the data doesn’t repeat patterns, overhead can hurt.

It’s very similar to Run-Length Encoding, except that the repeated chunks can be anywhere.

🛠️ Steven Gong

Dictionary Encoding

Graph View

Backlinks