Computer Science

What is MinHash (Minimum Hashing)?

MinHash is a probabilistic hashing technique used to estimate the similarity between two sets efficiently. Instead of computing the exact Jaccard similarity, MinHash provides a fast, space-efficient approximation.

Key Idea:

  • Instead of storing entire sets, MinHash compresses sets into fixed-size hash signatures.
  • These signatures allow us to compute the Jaccard similarity between two sets with high accuracy.
  • Used in search engines, near-duplicate detection, clustering, and recommender systems.

Why Use MinHash?

1. Space-Efficient Similarity Estimation

  • Instead of storing entire sets, we store a small fixed-size signature.

2. Faster than Exact Jaccard Similarity

  • Exact Jaccard computation requires O(n + m) time (for sets of size n and m).
  • MinHash provides an approximation in O(k) time, where k is the signature length.

3. Works Well for Large-Scale Data

  • Used in Google Search, document deduplication, and plagiarism detection.

How Does MinHash Work?

1. Jaccard Similarity (Baseline)

The Jaccard Similarity between two sets A and B is defined as: J(A,B) = \frac{|A \cup B|}{|A \cap B|}

  • Example:
    • A = {1, 2, 3, 4}
    • B = {3, 4, 5, 6}
    • J(A, B) = |{3, 4}| / |{1, 2, 3, 4, 5, 6}| = 2/6 = 0.33

2. MinHash Approximation

Instead of computing the exact intersection and union:

  1. Apply multiple hash functions to each element of the set.
  2. Record the smallest hash value for each function.
  3. Compare MinHash signatures to estimate similarity.

3. MinHash Signature

If we use k hash functions, each set is represented as a k-dimensional MinHash signature:

Signature(A) = [h1(A), h2(A), ..., hk(A)]
Signature(B) = [h1(B), h2(B), ..., hk(B)]
  • The fraction of matching hash values gives the estimated Jaccard similarity.

Example: MinHash Calculation

Step 1: Generate Hash Functions

We use three hash functions:

h1(x) = (2x + 1) % 5
h2(x) = (3x + 2) % 5
h3(x) = (5x + 4) % 5

Step 2: Compute MinHash Signature

Elementh1(x)h2(x)h3(x)
1304
2013
3242
4421

For Set A = {1, 2, 3}:

MinHash(A) = [min(3,0,2), min(0,1,4), min(4,3,2)] = [0, 0, 2]

For Set B = {2, 3, 4}:

MinHash(B) = [min(0,2,4), min(1,4,2), min(3,2,1)] = [0, 1, 1]

Step 3: Compute Estimated Similarity

  • Matching positions: [0, X, X] → 1/3 = 0.33
  • Actual Jaccard Similarity = 0.33 → MinHash approximation is close!

Python Implementation of MinHash

import random
import hashlib

class MinHash:
    def __init__(self, num_hashes=100):
        self.num_hashes = num_hashes
        self.hash_funcs = self._generate_hash_functions()

    def _generate_hash_functions(self):
        """Generate random hash functions of the form (ax + b) % p"""
        prime = 2147483647  # Large prime number
        return [(random.randint(1, prime), random.randint(0, prime)) for _ in range(self.num_hashes)]

    def _hash(self, x, a, b):
        """Compute hash (ax + b) % p"""
        return (a * x + b) % 2147483647

    def compute_signature(self, input_set):
        """Compute MinHash signature for a set"""
        signature = []
        for a, b in self.hash_funcs:
            min_hash = min(self._hash(x, a, b) for x in input_set)
            signature.append(min_hash)
        return signature

    def jaccard_similarity(self, sig1, sig2):
        """Estimate Jaccard similarity from MinHash signatures"""
        matches = sum(1 for x, y in zip(sig1, sig2) if x == y)
        return matches / self.num_hashes

# Example Usage
minhash = MinHash(num_hashes=50)

set_A = {1, 2, 3, 4, 5}
set_B = {3, 4, 5, 6, 7}

sig_A = minhash.compute_signature(set_A)
sig_B = minhash.compute_signature(set_B)

print("Estimated Jaccard Similarity:", minhash.jaccard_similarity(sig_A, sig_B))

Java Implementation of MinHash

import java.util.*;

class MinHash {
    private int numHashes;
    private List<int[]> hashFunctions;

    public MinHash(int numHashes) {
        this.numHashes = numHashes;
        this.hashFunctions = generateHashFunctions();
    }

    private List<int[]> generateHashFunctions() {
        List<int[]> hashFuncs = new ArrayList<>();
        Random rand = new Random();
        int prime = 2147483647;

        for (int i = 0; i < numHashes; i++) {
            int a = rand.nextInt(prime - 1) + 1;
            int b = rand.nextInt(prime);
            hashFuncs.add(new int[]{a, b});
        }
        return hashFuncs;
    }

    private int hash(int x, int a, int b) {
        return (a * x + b) % 2147483647;
    }

    public int[] computeSignature(Set<Integer> inputSet) {
        int[] signature = new int[numHashes];

        for (int i = 0; i < numHashes; i++) {
            int minHash = Integer.MAX_VALUE;
            for (int x : inputSet) {
                int hashValue = hash(x, hashFunctions.get(i)[0], hashFunctions.get(i)[1]);
                minHash = Math.min(minHash, hashValue);
            }
            signature[i] = minHash;
        }
        return signature;
    }

    public double jaccardSimilarity(int[] sig1, int[] sig2) {
        int matches = 0;
        for (int i = 0; i < numHashes; i++) {
            if (sig1[i] == sig2[i]) matches++;
        }
        return (double) matches / numHashes;
    }

    public static void main(String[] args) {
        MinHash minhash = new MinHash(50);

        Set<Integer> setA = new HashSet<>(Arrays.asList(1, 2, 3, 4, 5));
        Set<Integer> setB = new HashSet<>(Arrays.asList(3, 4, 5, 6, 7));

        int[] sigA = minhash.computeSignature(setA);
        int[] sigB = minhash.computeSignature(setB);

        System.out.println("Estimated Jaccard Similarity: " + minhash.jaccardSimilarity(sigA, sigB));
    }
}

Use Cases of MinHash

1. Search Engines

  • Google uses MinHash for near-duplicate detection.
  • Identifies similar web pages to avoid duplicate search results.

2. Recommender Systems

  • Used in Netflix, Amazon, YouTube to find similar users or items.

3. Document & Plagiarism Detection

  • Turnitin & CopyScape use MinHash to detect similar text across large corpora.

Summary

  • MinHash efficiently estimates Jaccard similarity with small memory usage.
  • Works well for large-scale applications like search engines, recommender systems, and document comparison.
  • Provides O(k) similarity estimation, making it much faster than exact computation.

🚀 MinHash is one of the most powerful algorithms for approximate similarity search in big data! 🚀

Aquinas

Hello! I'm Aquinas, a lifelong learner who finds everything in the world fascinating. I can’t ignore my curiosity, and this blog is where I document my journey of learning, exploring, and understanding various topics. I don’t limit myself to a single field—I enjoy diving into science, philosophy, technology, the arts, and more. For me, learning isn’t just about gathering information; it’s about applying knowledge, analyzing it from different perspectives, and discovering new insights along the way. Through this blog, I hope to record my learning experiences, share ideas, and connect with others who have a similar passion for knowledge. Let’s embark on this journey of exploration together! 😊

Recent Posts

What Is EPS(Earnings Per Share)?

When analyzing a stock, one of the first financial indicators you’ll encounter is EPS, or Earnings Per Share. It’s one… Read More

8 months ago

What is Market Capitalization? Everything Investors Need to Know

When you look at a stock’s profile on a financial website, one of the first things you’ll see is its… Read More

8 months ago

The MIT License

In the world of open-source software, simplicity and flexibility are often just as important as legal protection. That’s why the… Read More

9 months ago

Mozilla Public License (MPL)

If you want your software to be open source, but still compatible with commercial use—and not as restrictive as the… Read More

9 months ago

The Apache License 2.0

When it comes to open-source software, developers and businesses alike need licenses that balance freedom, legal clarity, and long-term security.… Read More

9 months ago

BSD (Berkeley Software Distribution) License

If you’re working on open-source projects or choosing third-party libraries for your software, understanding software licenses is essential. Among the… Read More

9 months ago