Choosing between SimHash and MinHash for a production system

minhash simhash

10,424

Solution 1

Simhash is faster (very fast) and typically requires less storage, but imposes a strict limitation on how dissimilar two documents can be and still be detected as duplicates. If you are using a 64-bit simhash (a common choice), and depending on how many permuted tables you are capable of storing, you might be limited to hamming distances of as low as 3 or possibly as high as 6 or 7. Those are small hamming distances! You'll be limited to detecting documents that are mostly identical, and even then you may need to do some careful tuning of what features you choose to go into the simhash and what weightings you give to them.

The generation of simhashes is patented by google, though in practice they seem to allow at least non-commercial use.

Minhash uses more memory, since you'd be typically storing 50-400 hashes per document, and it's not as CPU-efficient as simhash, but it allows you to find quite distant similarities, e.g. as low as 5% estimated similarity, if you want. It's also a bit easier to understand than simhash, particularly in terms of how the tables work. It's quite straightforward to implement, typically using shingling, and doesn't need a lot of tuning to get good results. It's not (to my knowledge) patented.

If you're dealing with big data, the most CPU-intensive part of the minhash approach will likely be after you've generated the minhashes for your document, when you're hunting through your table to find other documents that share some of its hashes. There may be tens or hundreds of thousands of documents that share at least one hash with it, and you've got to weed through all of these to find those few that share e.g. a minimum of half its hashes. Simhash is a lot quicker here.

As Otmar points out in his comment below, there are optimizations of minhash that allow you to achieve the same precision on your similarity estimates with fewer hashes per document. This can substantially reduce the amount of weeding you have to do.

Edit:

I have now tried superminhash. It's fairly fast, though my implementation of minhash using a single hash function plus bit-transformations to produce all the other hashes was faster for my purposes. It offers more accurate jaccard estimates, about 15% better under some situations I tested (though almost no difference under others). This should mean you need about a third fewer hashes to achieve the same accuracy. Storing fewer hashes in your table means less "weeding" is needed to identify near duplicates, which delivers a significant speed-up. I'm not aware of any patent on superminhash. Thanks Otmar!

Solution 2

This paper might give you some ideas on the two algorithms.

http://jmlr.org/proceedings/papers/v33/shrivastava14.pdf

10,424

Author by

Brian Spiering

Hi! I'm Brian Spiering. A consultant specializing in Natural Language Processing (NLP), Artificial Intelligence (AI), and Machine Learning. Providing training to engineers and advice on implementing cutting-edge algorithms to build data products. Drop me a line - [email protected]

Updated on June 12, 2022

Comments

Brian Spiering about 2 years

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use.

I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words).

Specific language implementation aside, which algorithm would be best for a greenfield production system?
otmar almost 7 years

There are some developments which made minwise hashing more compact (Li, Ping, and Christian König. "b-Bit minwise hashing." Proceedings of the 19th international conference on World wide web. ACM, 2010) and much faster ( Shrivastava, Anshumali. "Optimal densification for fast and accurate minwise hashing." arXiv preprint arXiv:1703.04664 (2017); Dahlgaard, Søren, et al. "Fast Similarity Sketching." arXiv preprint arXiv:1704.04370 (2017); Ertl, Otmar. "SuperMinHash-A New Minwise Hashing Algorithm for Jaccard Similarity Estimation." arXiv preprint arXiv:1706.05698 (2017)).
Ben Whitmore almost 7 years

Thanks Otmar! I've edited my answer to explain more about different aspects of efficiency and especially mention your superminhash algorithm.
duhaime about 6 years

@BenWhitmore is your superminhash code still around? If so, would you be willing to share a link? :)
Brian Spiering about 5 years

@BenWhitmore Thank for the detailed answer. I'm not concerned with after hashing performance because I have a two-stage system. 1st stage is high recall. 2nd stage is high precision.
Ben Whitmore about 5 years

@BrianSpiering If you're interested in high recall but not too bothered about speed or memory performance, consider starting with a straightforward minhash implementation: this should work well, and will also get you thinking in the right paradigm should you later want to retrofit some optimisations. Simhash will not give the high recall you're looking for, as it is very restrictive in terms of what duplicates it will find.
otmar over 4 years

I have recently published a paper about a new class of minwise hashing algorithms called ProbMinHash, arxiv.org/pdf/1911.00675.pdf. It contains charts comparing MinHash, SuperMinHash, and different versions of ProbMinHash over the input size. In particular, I recommend ProbMinHash3a as it is faster than MinHash for input sizes greater than 100 and also significantly faster than SuperMinHash. Code is available on GitHub: github.com/oertl/probminhash