
MinHash is a probabilistic hashing technique used to estimate the similarity between two sets efficiently. Instead of computing the exact Jaccard similarity, MinHash provides a fast, space-efficient approximation.
n and m).k is the signature length.The Jaccard Similarity between two sets A and B is defined as: J(A,B) = \frac{|A \cup B|}{|A \cap B|}
A = {1, 2, 3, 4} B = {3, 4, 5, 6} J(A, B) = |{3, 4}| / |{1, 2, 3, 4, 5, 6}| = 2/6 = 0.33Instead of computing the exact intersection and union:
If we use k hash functions, each set is represented as a k-dimensional MinHash signature:
Signature(A) = [h1(A), h2(A), ..., hk(A)]
Signature(B) = [h1(B), h2(B), ..., hk(B)]
We use three hash functions:
h1(x) = (2x + 1) % 5
h2(x) = (3x + 2) % 5
h3(x) = (5x + 4) % 5
| Element | h1(x) | h2(x) | h3(x) |
|---|---|---|---|
| 1 | 3 | 0 | 4 |
| 2 | 0 | 1 | 3 |
| 3 | 2 | 4 | 2 |
| 4 | 4 | 2 | 1 |
For Set A = {1, 2, 3}:
MinHash(A) = [min(3,0,2), min(0,1,4), min(4,3,2)] = [0, 0, 2]
For Set B = {2, 3, 4}:
MinHash(B) = [min(0,2,4), min(1,4,2), min(3,2,1)] = [0, 1, 1]
[0, X, X] → 1/3 = 0.33import random
import hashlib
class MinHash:
def __init__(self, num_hashes=100):
self.num_hashes = num_hashes
self.hash_funcs = self._generate_hash_functions()
def _generate_hash_functions(self):
"""Generate random hash functions of the form (ax + b) % p"""
prime = 2147483647 # Large prime number
return [(random.randint(1, prime), random.randint(0, prime)) for _ in range(self.num_hashes)]
def _hash(self, x, a, b):
"""Compute hash (ax + b) % p"""
return (a * x + b) % 2147483647
def compute_signature(self, input_set):
"""Compute MinHash signature for a set"""
signature = []
for a, b in self.hash_funcs:
min_hash = min(self._hash(x, a, b) for x in input_set)
signature.append(min_hash)
return signature
def jaccard_similarity(self, sig1, sig2):
"""Estimate Jaccard similarity from MinHash signatures"""
matches = sum(1 for x, y in zip(sig1, sig2) if x == y)
return matches / self.num_hashes
# Example Usage
minhash = MinHash(num_hashes=50)
set_A = {1, 2, 3, 4, 5}
set_B = {3, 4, 5, 6, 7}
sig_A = minhash.compute_signature(set_A)
sig_B = minhash.compute_signature(set_B)
print("Estimated Jaccard Similarity:", minhash.jaccard_similarity(sig_A, sig_B))
import java.util.*;
class MinHash {
private int numHashes;
private List<int[]> hashFunctions;
public MinHash(int numHashes) {
this.numHashes = numHashes;
this.hashFunctions = generateHashFunctions();
}
private List<int[]> generateHashFunctions() {
List<int[]> hashFuncs = new ArrayList<>();
Random rand = new Random();
int prime = 2147483647;
for (int i = 0; i < numHashes; i++) {
int a = rand.nextInt(prime - 1) + 1;
int b = rand.nextInt(prime);
hashFuncs.add(new int[]{a, b});
}
return hashFuncs;
}
private int hash(int x, int a, int b) {
return (a * x + b) % 2147483647;
}
public int[] computeSignature(Set<Integer> inputSet) {
int[] signature = new int[numHashes];
for (int i = 0; i < numHashes; i++) {
int minHash = Integer.MAX_VALUE;
for (int x : inputSet) {
int hashValue = hash(x, hashFunctions.get(i)[0], hashFunctions.get(i)[1]);
minHash = Math.min(minHash, hashValue);
}
signature[i] = minHash;
}
return signature;
}
public double jaccardSimilarity(int[] sig1, int[] sig2) {
int matches = 0;
for (int i = 0; i < numHashes; i++) {
if (sig1[i] == sig2[i]) matches++;
}
return (double) matches / numHashes;
}
public static void main(String[] args) {
MinHash minhash = new MinHash(50);
Set<Integer> setA = new HashSet<>(Arrays.asList(1, 2, 3, 4, 5));
Set<Integer> setB = new HashSet<>(Arrays.asList(3, 4, 5, 6, 7));
int[] sigA = minhash.computeSignature(setA);
int[] sigB = minhash.computeSignature(setB);
System.out.println("Estimated Jaccard Similarity: " + minhash.jaccardSimilarity(sigA, sigB));
}
}
🚀 MinHash is one of the most powerful algorithms for approximate similarity search in big data! 🚀
When analyzing a stock, one of the first financial indicators you’ll encounter is EPS, or Earnings Per Share. It’s one… Read More
When you look at a stock’s profile on a financial website, one of the first things you’ll see is its… Read More
In the world of open-source software, simplicity and flexibility are often just as important as legal protection. That’s why the… Read More
If you want your software to be open source, but still compatible with commercial use—and not as restrictive as the… Read More
When it comes to open-source software, developers and businesses alike need licenses that balance freedom, legal clarity, and long-term security.… Read More
If you’re working on open-source projects or choosing third-party libraries for your software, understanding software licenses is essential. Among the… Read More