Opposite of Bloom filter?

13,439

Solution 1

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".

Forgive the extremely crude pseudocode:

setup_test_table():
    create test_table( some large number of entries )
    clear each entry( test_table, NEVER )
    return test_table

has_test_been_run_before( new_test_details, test_table ):
    index = hash( test_details , test_table.length )
    old_details = test_table[index].detail
    // unconditionally overwrite old details with new details, LRU fashion.
    // perhaps some other collision resolution technique might be better.
    test_table[index].details = new_test_details
    if ( old_details === test_details ) return YES
    else if ( old_details === NEVER ) return NEVER
    else return PERHAPS    

main()
    test_table = setup_test_table();
    loop
        test_details = generate_random_test()
        status = has_test_been_run_before( test_details, test_table )
        case status of
           YES: do nothing;
           NEVER: run test (test_details);
           PERHAPS: if( rand()&1 ) run test (test_details);
    next loop
end.

Solution 2

The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.

function set_member(set, item)
    set[hash(item) % set.length] = item

function is_member(set, item)
    return set[hash(item) % set.length] == item

Solution 3

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

Solution 4

Use a bit set, as mentioned above. If you know the no. of tests you are going to run beforehand, you will always get correct results (present, not-present) from the data structure.
Do you know what keys you will be hashing? If so, you should run an experiment to see the distribution of the keys in the BloomFilter so you can fine tune it to reproduce false positives, or what have you.
You might want to checkout HyperLogLog as well.

View more solutions

13,439

Author by

Admin

Updated on July 31, 2022

Comments

Admin almost 2 years

I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently.

So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an equivalent to a Bloom filter, but erring on the opposite side, that is, only giving false negatives.

I've skimmed through the literature without any luck.
Admin over 15 years

Thanks for your reply. I think it's still useful as I can always stop after a fixed amount of time. In fact I can keep on generating tests forever. But such a data structure will help me to ensure that most tests are in fact new ones without running out of memory fast.
RarrRarrRarr about 14 years

You can't do that because it's impossible to take elements out of a Bloom filter.
awdz9nld over 11 years

Also on the order of millions, you could use a simple bit array. :)
roam about 11 years

@MartinKällman You're correct if memory efficiency is not a requirement because all the solutions proposed above require store of the original items (set_member in your case). After memory limit is met, it would give both false negative and false positive results. Bloom Filter never gives false negative results even when the false positive rate is rather high due to too many inputs.
awdz9nld about 11 years

Yes, that is a requirement for any associative array
Pimin Konstantin Kefaloukos over 7 years

I would add that any answer that combines a memory model with an eviction strategy, such as MRU, LFU or ARC, is a valid answer to this question.
Viktor Mellgren over 7 years

or a Cuckoo filter cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf bdupras.github.io/filter-tutorial
awdz9nld about 7 years

while any lossy set with discrete membership could be said to be of the family of data structures considered "opposite" of sets with probabilistic membership, the LRU heuristic is a completely separate concern and has no direct relevance to the question. admittedly, so does my own answer (it assumes associativity 1), if we were to generalise. it is sufficient to say that there is a transform f(set, item) -> set' defined such that given a set and an item, a new set' is produced which may include item as a member, subject to cardinality constraints. the choice of f is irrelevant