HyperLogLog Counter (HLL)

HyperLogLog is a probabilistic data structure that estimates the cardinality of a multiset (number of unique elements).

I was introduced in CS451.

Resources

Problem Setup

We want to check the amount of unique elements we’ve seen so far.

However, using a regular set is too large to hold in memory (remember that the hashset stores both the hash and the original values in case of hash collisions).

  • Counting unique items usually requires an amount of memory proportional to the number of items you want to count

A smarter way is to simply store a hashed value.

Observation: hash(item) vector of e.g. 32 bits

  • ½ of items will have a hash code starting with 0
  • ¼ of items will have a hash code starting with 00

How does this help us?

If we’ve seen ~64 elements, we would expect (As in ) that one of them would start with 000000 (6 zeros, 1 in 64 items).

All we need to do is record the longest string of leading 0s we’ve seen in any hash codes!

If we’ve seen , our estimate is we’ve seen approximately unique items.