What are Bloom filters?

What are Bloom filters?

A tale of code, dinner, and a favour with unexpected consequences.

Image for post

Next we put Dog into our filter. Using the same hash functions, Dog hashes to 5, 2, and 1, so those buckets get filled up too. After we?ve done both of these operations, buckets 1, 2, 3, 4, 5, and 10 are full.

Image for post

Does Ev have a cat? We can ask the filter. Because the same input always hashes to the same output, Cat still hashes to 3, 4, and 10, so we need to check all those buckets. Because they?re all full, we?re pretty sure that Ev has a cat.

So let?s find out if Ev has an elephant. Elephant hashes to 10, 6, and 7. When we check those buckets, 10 is full, but 6 and 7 are empty. We know that if Elephant had been put into the filter, buckets 10, 6, and 7 would be full. So we can categorically say that Ev doesn?t have an elephant. Yet.

Image for post

But how do we get false positives? Imagine now that we want to see if Ev has a monkey. As Brad recalled, Monkey hashes to 10, 2, and 1. We check buckets 10, 2, and 1, and lo and behold, they?re all full. The Bloom filter tells us that Ev has a Monkey, even though he doesn?t. Yet.

Image for post

As a corollary, a Bloom filter can?t answer the question ?what pets does Ev Williams have??, because it?s one-way. There?s no way to tell what combination of fauna, exotic or otherwise, led to the current set of filled buckets.

We can control the probability of getting a false positive by controlling the size of the Bloom filter. More space means fewer false positives. Say it with me, people: trade-offs.

Digestif

We leave the restaurant and head down the road for a nightcap.

?Ok, let me see if I have this right,? says Sarah. ?You want to be able to quickly exclude posts that a user has read, so that you don?t suggest those posts again. And a Bloom filter is a good fit for this problem, because it will never let through a post that the user has read, and even though it might exclude a post that they haven?t read, that?s ok because they?ll never know what they don?t see? And it?s very fast??

?Precisely,? I reply.

?That would make a good story,? says Marcin.

Thanks are due to Sarah Agudo, whose interest in technology frequently sparks discussions like this, and who has an incredibly high tolerance for long-winded explanations. Thanks to Nick Santos for offering feedback on the technical portions. Thanks also, I guess, to Marcin Wichary for manipulating me into writing this, and for the small details that make reading the written word on Medium a joy. Though I feel like he owes me one now.

As in any piece of writing about code, there are certain simplifications made in deference to the story, and to clearer communication. All errors are my own. If you spot anything particularly egregious, leave a note in the margins!

20