Min-Hashing

Min-Hashing applies Locality-Sensitive Hashing for Jaccard Distances.

Problem Setup - Near Neighbours

Problem:

  • – Set of Objects
  • – Distance from object to
  • – maximum distance threshold Goal: Find all unordered pairs s.t. .
  • The The naïve approach: compute all pairs in

But with min-hashing, we can achieve performance!

Some examples of real-world problems for near neighbours

  • Template Based Protein Folding
  • Document similarity

Min-hashing (two views) Imagine you have a random permutation .

  • Do this exercise, you will understand how the signature matrix is calculated, but it’s pretty straightforward

Abstract

  • Step 1 – convert each document into n-grams
  • Step 1.1 – convert each unique n-gram into an integer
  • Step 2 – Generate a set of universal hash functions
  • Step 3 – For each document, compute the short signature vector
  • Step 4 – Pick values of R, B to tune to the false-positive and/or false negative rates you want
  • Step 5 – Hash each of the B bands for each document to find candidate pairs
  • Step 6 (technically optional, but absurd to skip) – Confirm the signatures are similar
  • Step 7 (more optional) – Confirm the documents are similar