What is the advantage to using bloom filters?

54,816

Solution 1

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

Pretty clear to me.

A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).

And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.

So an example usage pattern might be:

You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.

Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).

Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.

Solution 2

Alex has explained it pretty well. For those who still did not get quite a grasp on it, hopefully this example will help you understand:

Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with their keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a URL in the address bar, I need to check if it malicious. Thus I run the URL through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Let's say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in the browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the URL MIGHT be in the list of malicious URLs. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the URL is actually present. Since most of the times, a URL is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the URL MIGHT be malicious, only in those cases we make a call to the server. That 'MIGHT' is 99% right.

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every URL entered.

We can see that hash table with a single hash function is used for a different purpose altogether than a bloom filter. Hopefully this clears your doubts :)

edit:

I have implemented a bloom filter for the task of malicious URL testing in Python. The code can be found here - https://github.com/tarunsharma1/Bloom-Filter The code is very simple to understand and a detailed description is provided in the readme file.

Solution 3

I will start with the explanation of what is a bloom filter, what it can and can't do, why do we need it, show an intuitive description how it works and then give some example when they can be useful.

So a standard bloom filter is a probabilistic data structure that can^*:

add element to a set
check if an element is in the set by telling definitely not in the set or possibly in the set

This possibly in the set is exactly why it is called probabilistic. Using smart words it means that false positive are possible (there can be cases where it falsely thinks that the element is positive) but false negative are impossible.

But it can't ^*:

remove an item from the set
give you a list of all elements that are currently in your set

^*_{This set of can/can't is for a basic bloom filter. Because it is a useful data structure that was created long time ago, people found how to augment it with other useful features.}

But wait a minute: we already know a data structure that can answer all of this without vague 'possible' and also without all of the limitations (can't remove, can't show all). And it is called a set. And here comes a main advantage of a bloom filter: it is space efficient and space constant.

This means that it does not matter how many elements do we store there, the space will be the same. Yes a bloom filter with 10^6 elements (useless bloom filter) will take the same amount of space as a bloom filter with 10^20 elements and the same space as bloom filter with 0 elements. So how much space will it take? It is up to you to decide (but there is a trade of: the more elements you have the more uncertain you are with you possible in the set answer.

Another cool thing is that it is space constant. When you save the data to a set, you have to actually save this data. So if you store this long string in the set you have to at least use 27 bytes of space. But for a 1% error and an optimal value of k ^**, you will need ~ 9.6 bits ( < 2 bytes) per any element (whether it is a short int or a huge wall of text).

Another property is that all the operations are taking constant time, which is absolutely not the same as amortized constant time in case of sets (remember that if the set has collisions, it can deteriorate in O(n) time).

^**_{k is a value of hash functions used in the bloom filter}

I will not describe how the bloom filters work (wikipedia article does a very good job explaining everything). Here I will just briefly tell the basics.

you initiate an empty bit array of length m
you select k different hash functions (the more independent the better)
if you want to add element, you calculate all the k hashes of this value and set the corresponding bits to 1
if you want to check if element exist, you also calculate all k hashes and if at least one of them is not set, it is surely is not in the set. Otherwise it can be in the set.

Even this description is enough to understand why we can't be sure (you can get all bits set from various other values). Here is a very nice visualization of how it works.

So when can bloom filters be useful? The short answer is everywhere where false positive are acceptable and where you would want to check if something is in the set, but even if they are not, it can be a first line of defense to rule out expensive calls to verifiers.

Here is a list of more concrete descriptions:

a standard example of malicious websites and a browser is described in almost any place where people talk about bloom filters
is a passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
if you have a list of articles and a list of users, you can use bloom filter to show users' articles they have not read. Interesting thing is that you can have only one filter (you check whether the combination of user_id + article_id is there)
bitcoin uses bloom filter for wallet synchronization
Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom's filter article at wiki)

Solution 4

Bloom filters are quite useful in bioinformatics. They can be more space efficient compared to using a regular hash, especially when the size of the strings you are working with can be hundreds of millions of letters with a very small alphabet ie {A,G,T,C} . They are usually used to assess whether a certain k-mer is present or absence in a genome. There's an example of one used for something relevant here.

EDIT:

The multiple hash functions are used to minimize false positives. The hope is that between all of the k-hash functions each value will have a unique signature in the bit-array compared to every other possible value. However, false positives do exist, but they can be minimized to a manageable level. Using this technique you hash elements independently of their size. When you search for them, you use each hash function and check to make sure their bit-values are all 1.

Compare this to the human genome, where an increase in the size of the element increases the size of the hash table significantly (The table size is 4*4^k). This is assuming you encode the the elements using 2 bits / letter.

Solution 5

If a Bloom filter returns that an item is member of the set, there's a certain probability for a false positive. If only a single hash function were used to indicate membership in the set, the probability of a false positive would be higher than using multiple hash functions.

View more solutions

54,816

Author by

headache

Updated on July 16, 2022

Comments

headache almost 2 years

I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful?
headache over 13 years

Sorry, maybe I am misunderstanding but how can they be more space efficient compared to a regular hash? The hash of a string is fixed length output, and you simply set that value to 0 or 1. This is also what bloom filters would do, but bloom filters would do it on multiple hash functions. Where am I misunderstanding?
headache over 13 years

If it's clear, please answer. How could this possibly be more space efficient than a single hash function on the same sized set? This will simply create more collisions. You will bounce around searching on separate hash functions to ensure you have a 1 at all of the hash functions. I don't understand it's advantage over using a single hash function.
Alex Budovski over 13 years

A hash function is code, not data. What do you intend to use the hash function with? A hash table? In that case, your table will need to store keys, which could be of arbitrary size, unlike a bloom filter. The excerpt mentions this.
headache over 13 years

Consider a bloom filter with only one hash function, rather than k. What is the advantage to adding more hash functions? This will simply create more collisions. Or am I wrong?
Alex Budovski over 13 years

That is answered by the last paragraph in "Space and time advantages" in the Wikipedia article, and the "Probability of false positives" section.
Michael Burr over 13 years

@headache: but you generally won't get collisions across all the hash functions.
headache over 13 years

It just clicked. Thank you so much, this has bugged me for a while. It decreases the number of false positives because a false positive would need to either a) be a collision on all of your hash functions or b) all of the spaces have been filled by other values. Choosing the size must be a tricky process then, I guess. Correct me if I'm wrong, but I think I get it. Thanks everybody.
Alex Budovski over 13 years

@headache, "Choosing the size must be a tricky process then, I guess" -- not really, the optimal formula is already given in the Wikipedia article for both k and m in terms of n.
Bret Fontecchio almost 10 years

It's not much use to just store a single hash. Then it would have no way of handling hash collisions. Most hash table implementations have a way of dealing with this that incurs overhead. Python dictionaries for example store the key alongside the hash and start linearly probing upon collision. The bloom filter cuts that out and tries to minimize the damage inherent to doing so by using multiple hashes.
Pacerier almost 10 years

@AlexBudovski, In your 2nd-last paragraph, you stated that we "would have to do [disk-reads] if it were a hash". However, why do we need a diskread if hash(item) doesn't even match any key in the hashtable?
Kyle almost 10 years

@Pacerier I'm going to take a stab. If your hash table is stored on disk (like, say, a k/v database that uses mmap) you will need to hit disk in order to even see if the key exists at all. With a bloom filter, you load a smaller set of information about your elements into memory and do the look up with that. Am I correct, Alex?
Pacerier almost 10 years

@Kyle, If we had a hash table with 65536 buckets, we only need 65536 bits—(that's a mere 64 kb of ram, how could a bloom filter beat that?)—to keep track of all the buckets' exist-or-not state. It's as simple as boolean bucket_exist = state[hash(item, 0, 65536)] which requires exactly zero disk operations.
seeker almost 10 years

One of the best explanations of why need a bloom filter. Great answer!
user3467349 over 9 years

@Pacerier That's one collision for every 65536 values - which is pretty non-trivial error. It's correct though that the values don't need to be stored in the hash-table, just a boolean field is fine.
Squiggs. almost 9 years

Thanks for a use case scenario.
divinedragon almost 9 years

I didn't get the part of hashing and associating a value of 0 or 1. If we are using an array, and store 0 and 1 in those, how do we look up for hash value of a url when we are performing the test?
Tarun almost 9 years

So basically we use something called a hash function..which takes in the URL as a string..and gives out a number..we use this number and set the corresponding array index value to 1. A number of different hashing functions exist, but what is important is that every time the same URL is passed through a hashing function, it has to generate the same number. An example of a hashing function could be to add up the ascii values of all the characters in a URL. In bloom filters we use many hashing functions and set all those array index values to 1. Hope this cleared your doubt.
Giorgi Moniava almost 8 years

Why not create a bloom filter but with just one hash function? maybe "relatively large" hash function. But one instead of many
Pacerier about 7 years

@user3467349, An error of 1/65536 is 655.36 times less than the 1% error rate: What do you mean by "pretty non-trivial"?
Pacerier about 7 years

@headache, AlexBudovski. Re "You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present"... Note that this is exactly the same as a Hash table. So really, I have never understood this: How is a bloom-filter "table" any different from a multi-hash "table"? Aren't they exactly the same thing?
Pacerier about 7 years

Need some serious elaboration on the meat of the answer: "the probability of a false positive would be higher than using multiple hash functions"...
Qwertie over 6 years

A conventional hashtable such as C# HashSet<String> will use 16 bytes per element element in the best-case scenario in which the hashtable is completely full: 4 bytes map from a "bucket" to an entry in an entries table (an array-packed singly-linked list), 4 bytes for the cached hashcode, 4 bytes for the "next" pointer, 4 bytes for a pointer to the key. And that's not counting the string sizes. In the worst case it's 40 bytes: half the entries are unused and 20 bytes per entry once the String pointer expands to 8 bytes for 64-bit architectures.
user1028741 about 5 years

You don't have to save the String itself in the hash set. You can save the hash of it as the value, making the hashset much smaller. Then you can play with the hash size - the bigger it is, the smaller false positive rate will be.