
Bloom Join is an efficient join algorithm used in distributed databases to reduce data transfer when performing joins across multiple nodes. It utilizes a Bloom Filter to pre-filter data before performing the actual join, minimizing unnecessary communication between distributed systems.
Assume we have two distributed tables:
| Table A (Small Table) | Table B (Large Table) | |
|---|---|---|
| CustomerID | Name | CustomerID |
| 101 | Alice | 101 |
| 102 | Bob | 102 |
| 103 | Carol | 104 |
| 104 | Dave | 105 |
Bloom Filter contains (101, 102, 103, 104).
Only matching rows from Table B (101, 102, 104) are sent.
| CustomerID | Name | Purchases |
|---|---|---|
| 101 | Alice | $100 |
| 102 | Bob | $200 |
| 104 | Dave | $150 |
✅ Optimized Join:
from pybloom_live import BloomFilter
import random
# Step 1: Create Table A (small table)
table_a = {101: "Alice", 102: "Bob", 103: "Carol", 104: "Dave"}
# Step 2: Build a Bloom Filter for Table A's keys
bloom = BloomFilter(capacity=100, error_rate=0.01)
for key in table_a.keys():
bloom.add(key)
# Step 3: Table B (large table) - Simulated
table_b = {101: 100, 102: 200, 104: 150, 105: 300}
# Step 4: Filter Table B using Bloom Filter
filtered_b = {k: v for k, v in table_b.items() if k in bloom}
# Step 5: Perform Join
result = {k: (table_a[k], v) for k, v in filtered_b.items()}
print("Bloom Join Result:", result)
Bloom Join Result: {101: ('Alice', 100), 102: ('Bob', 200), 104: ('Dave', 150)}
SET spark.sql.bloomFilter.enabled=true;
SET spark.sql.bloomFilter.numItems=1000000;
SET spark.sql.bloomFilter.fpp=0.01;
CREATE TABLE customer_small (
customer_id INT,
name STRING
) USING PARQUET;
INSERT INTO customer_small VALUES (101, 'Alice'), (102, 'Bob'), (103, 'Carol');
CREATE TABLE transactions_large (
customer_id INT,
amount INT
) USING PARQUET;
INSERT INTO transactions_large VALUES (101, 100), (102, 200), (104, 150), (105, 300);
-- Apply Bloom Filter on customer_id in customer_small
ALTER TABLE customer_small SET TBLPROPERTIES ('bloom_filter_columns'='customer_id');
SELECT c.customer_id, c.name, t.amount
FROM customer_small c
JOIN transactions_large t
ON t.customer_id = c.customer_id;
✅ Bloom Filter prevents scanning unnecessary rows in transactions_large, improving performance.
| Feature | Bloom Join | Hash Join | Broadcast Join |
|---|---|---|---|
| Network Cost | ✅ Low (Only relevant data transferred) | ❌ High | ❌ High |
| Memory Usage | ✅ Low (Only a Bloom Filter stored) | ❌ High | ❌ High |
| Performance | ✅ Faster for large tables | ❌ Slow if tables are large | ✅ Fast for small tables |
| False Positives? | ❌ Yes (rare, configurable) | ❌ No | ❌ No |
| Best For | Large distributed joins | General purpose joins | Small tables |
🚀 If you’re working with distributed databases, Bloom Join is an essential optimization to know! 🚀
When analyzing a stock, one of the first financial indicators you’ll encounter is EPS, or Earnings Per Share. It’s one… Read More
When you look at a stock’s profile on a financial website, one of the first things you’ll see is its… Read More
In the world of open-source software, simplicity and flexibility are often just as important as legal protection. That’s why the… Read More
If you want your software to be open source, but still compatible with commercial use—and not as restrictive as the… Read More
When it comes to open-source software, developers and businesses alike need licenses that balance freedom, legal clarity, and long-term security.… Read More
If you’re working on open-source projects or choosing third-party libraries for your software, understanding software licenses is essential. Among the… Read More