HyperLogLog Counter (HLL)
HyperLogLog is a probabilistic data structure that estimates the cardinality of a multiset (number of unique elements).
I was introduced in CS451.
Resources
Problem Setup
We want to check the amount of unique elements we’ve seen so far.
However, using a regular set is too large to hold in memory (remember that the hashset stores both the hash and the original values in case of hash collisions).
- Counting unique items usually requires an amount of memory proportional to the number of items you want to count
A smarter way is to simply store a hashed value.
Observation: hash(item) → vector of e.g. 32 bits
- ½ of items will have a hash code starting with 0
- ¼ of items will have a hash code starting with 00
- …
How does this help us?
If we’ve seen ~64 elements, we would expect (As in ) that one of them would start with 000000 (6 zeros, 1 in 64 items).
All we need to do is record the longest string of leading 0s we’ve seen in any hash codes!
If we’ve seen , our estimate is we’ve seen approximately unique items.