## Bloom Filters

A Bloom filter is a data structure for managing sets of values. Bloom filters provide O(1) lookups and insertion and, perhaps most importantly, provide an extremely compact representation of the set of values being stored. The trade-off for this compact representation is that the lookup operation can have false positives. In other words, lookup(x) may return true even when x isn’t in the set.

You might be wondering why we’d be willing or able to tolerate false positives in set lookups. There are actually lots of scenarios where this makes sense. For instance, the original application of Bloom filters–spell checking on limited-memory machines–remains a fine motivating example.

In spell checking, a Bloom filter is used to store a dictionary of correctly-spelled words. If lookup(word) returns false, the spell checker flags word as a misspelling. False positives in this application, e.g., lookup(‘notaword’) == True, results in some misspellings going unnoticed. That might seem to be a bad thing, but it’s all about balancing trade-offs. Bloom filters allow the spell checking application to load a comprehensive dictionary into a small amount of memory and makes spell checking fast enough that users can run the checker often. The small memory footprint can be achieved with a false positive rate that results in approximately 1 in 100 misspellings going undetected. Other trade-offs, like using a smaller dictionary or running the spell checker less frequently, might result in even higher error rates.

Bloom filters were invented by Burton Bloom in 1970 and described in his seminal paper Space/time Trade-offs in Hash Coding With Allowable Errors. Even though they’ve been around for 37 years now, are straighforward to implement, and have many, many practical uses, you typically don’t find Bloom filters described in data structures textbooks or taught in University undergraduate data structures courses. That’s something that should probably change IMO.

So, how do Bloom filters work? The concept is relatively simple, assuming that you’re already familiar with hashing and some simple probability. Recall that in traditional hashing you have a function h(val) that maps val onto an index in a table. Ideally, you want a function h() such that

- h(val1) == h(val2) when val1 == val2
- h(val1) != h(val2) when val1 != val2

In practice it’s vary hard to ensure condition 2 given fixed table sizes, so you have to relax the condition to something like “h(val1) is unlikely to equal h(val2) when val1 != val2″. This means that you occasionally get collisions where distinct values get the same hash value. There’s a vast literature out there about designing hash functions and sizing tables to minimize collisions, and describing how to handle collisions when they do occur. The salient point for this discussion is that these ‘traditional’ hashing techniques require that the complete value being hashed, or a unique proxy of that value, be stored in the hash table so that collisions can actually be detected.

This is where Bloom filters differ and the reason why they have false positives on lookup. When adding a value to a Bloom filter you compute k hashes which gives you k indices into an m-bit bit vector. The k bit vector entries for those indices are set to 1. To look up a value, you also compute k hashes to get k indices. If all k bit vector entries are 1 then you return true indicating that the value was found, otherwise, you return false. Here’s how you would do this in Python:

```
import BitVector
class BloomFilter:
def __init__(self, m, k):
"""Instantiate an m-bit Bloom filter using k hash indices
per value."""
self.n = 0
self.m = m
self.k = k
self.bv = BitVector.BitVector(size = self.m)
def Insert(self, s):
for i in self._HashIndices(s):
self.bv[i] = 1
self.n += 1
def Lookup(self, s):
for i in self._HashIndices(s):
if self.bv[i] != 1:
return False
return True
```

[Note: this snippet uses Avinash Kak's BitVector module.]

Both m and k are user configurable values in this code, and we haven’t really said how to choose either. If we’re going to store n values in a Bloom filter, the probability of a false positive on any given lookup is given by pow(1 – exp(-kn/m)), k). In other words, the false positive rate increases as k and n get bigger, and as m gets smaller. If you have m and n, then a reasonable starting point for k is 0.7(m/n). The derivation for these formula are an exercise in simple probability. However, I’m going to leave it as an exercise given that this post is already running long.

Now, the next big question we have to answer in order to complete our implementation is how exactly to do the hashing. For a given population of n values, we want our k hash functions to distribute bits uniformly over the m indices in the bit vector. One way to do this is to choose k mutually independent hash functions. Alternately, if you have a hash function that produces a large, uniformly distributed set of bits for each value, you can chop the hash output into k buckets and use each bucket as a hash value.

When m is reasonably big (more than a few ten thousand bits) then you can get away with just having two independent hash functions using a technique discovered by Kirsch and Mitzenmacher. You can compute the k hash values as h_i(val) = (h1(val) + i * h2(val)) % m. Here’s our implementation.

```
def _HashIndices(self, s):
indices = []
for i in xrange(1, self.k + 1):
indices.append((hash(s) + i * hashpjw(s)) % self.m)
return indices
```

Notice that for h1() we’ve used Python’s built in hash function. For h2() I’ve used a simple but very effective string hashing function due to Peter Weinberger. The implementation of that function follows.

```
def hashpjw(s):
val = 0
for c in s:
val = (val << 4) + ord(c)
tmp = val & 0xf0000000
if tmp != 0:
val = val ^ (tmp >> 24)
val = val ^ tmp
return val
```

Now all that’s left to do is to measure our performance. I’ve written a little chunk of test code that loads approximately 90% of the words from GNU aspell’s English language dictionary into a Bloom filter. The remaining 10% of the words are used as a holdback to check the false positive rate of the filter. Here’s the code.

```
def TestFilter():
import random
# holdback will record words not added to the Bloom filter
holdback = set()
# Instantiate an ~1Mbit bloom filter with k=8
bf = BloomFilter(1090177, 8 )
# Open file with one English word per line
f = open('data/words.dat')
# Add each line to either holdback or Bloom filter
for line in f:
val = line.rstrip()
if random.random() <= 0.10:
# Add ~10% of values to holdback
holdback.add(val)
else:
# Add ~90% of values to Bloom filter
bf.Insert(val)
f.close()
# Print information about current state of Bloom filter
bf.PrintStats()
# Count false positives -- # holdback items in the Bloom filter
num_false_positives = 0
for val in holdback:
if bf.InFilter(val):
num_false_positives += 1
# Compute false positive rate and print
rate = 100.0 * float(num_false_positives) / float(len(holdback))
print "Actual false positive rate = %.2f%% (%d of %d)" % (rate,
num_false_positives, len(holdback))
```

There’s one more function that I haven’t shown that prints some statistics about the current state of the Bloom filter. That code follows.

```
def PrintStats(self):
k = float(self.k)
m = float(self.m)
n = float(self.n)
p_fp = math.pow(1.0 - math.exp(-(k * n) / m), k) * 100.0
compression_ratio = float(self.bits_in_inserted_values) / m
print "Number of filter bits (m) : %d" % self.m
print "Number of filter elements (n) : %d" % self.n
print "Number of filter hashes (k) : %d" % self.k
print "Predicted false positive rate = %.2f" % p_fp
print "Compression ratio = %.2f" % compression_ratio
```

And finally, here’s the output of a single run of TestBloomFilter(). Notice that the Bloom filter is about 1Mbit, contains ~126k words and achieved a compression ratio of 7.97 (over just storing the bits of the original ~126k words). The predicted false positive rate (1.81%) was very close to the measured rate (1.87%). If we wanted to reduce this false positive rate further, holding the number of words in the dictionary constant, we could increase either m or k or both.

```
Number of filter bits (m) : 1090177
Number of filter elements (n) : 126733
Number of filter hashes (k) : 8
Predicted false positive rate = 1.81%
Compression ratio = 7.97
Actual false positive rate = 1.87% (265 of 14178)
```

Full code is available here. Don’t forget to grab BitVector module if you don’t already have it.

Next time, I’ll talk about some contemporary uses for Bloom filters.

I noticed that you calculate the hashindices on each for-loop in your function _HashIndices(). If you except larger quantities of words (like >500.000) to be inserted into a big Bloomfilter (600kB Bitvec) you’ll get an optimum number of 7 Hashfunctions… calculating 2 Hashvalues might then be a little faster than calculating 14 Hashvalues for each word

greetings