Fastest way to check if a List<String> contains a unique String

71,326

Solution 1

Your best bet is to use a HashSet and check if a string exists in the set via the contains() method. HashSets are built for fast access via the use of Object methods hashCode() and equals(). The Javadoc for HashSet states:

This class offers constant time performance for the basic operations (add, remove, contains and size),

HashSet stores objects in hash buckets which is to say that the value returned by the hashCode method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet has to perform via the equals() method is reduced to just the other Objects in the same hash bucket.

To use HashSets and HashMaps effectively, you must conform to the equals and hashCode contract outlined in the javadoc. In the case of java.lang.String these methods have already been implemented to do this.

Solution 2

In general, a HashSet will give you better performance, since it does not have to look through each element and compare, like an ArrayList does, but typically compares at most a few elements, where the hashcodes are equal.

However, for 1M strings, the performance of hashSet may still not be optimal. A lot of cache misses will slow down searching the set. If all strings are equally likely, then this is unavoidable. However, if some strings are more often requested than others, then you can place the common strings into a small hashSet, and check that first, before checking the larger set. The small hashset should be sized to fit in cache (e.g. a few hundred K at most). Hits to the small hashset will then be very fast, while hits to the larger hashset proceed at speed limited by the memory bandwidth.

Solution 3

Before going further, please consider this: Why are you worried about performance? How often is this check called?

As for possible solutions:

  • If the list is already sorted, then you can use java.util.Collections.binarySearch which offers the same performance characteristics as a java.util.TreeSet.

  • Otherwise you can use a java.util.HashSet that as a performance characteristic of O(1). Note that calculating the hash code for a string that doesn't have one calculated yet is an O(m) operation with m=string.length(). Also keep in mind that hashtables only work well until they reach a given load factor, i.e. hashtables will use more memory than plain lists. The default load factor used by HashSet is .75, meaning that internally a HashSet for 1e6 objects will use an array with 1.3e6 entries.

  • If the HashSet does not work for you (e.g. because there are lots of hash-collisions, because memory is tight or because there are lots of insertions), than consider using a Trie. Lookup in a Trie has a worst-case complexity of O(m) where m=string.length(). A Trie has also some extra-benefits that might be useful for you: e.g., it can give you the closest fit for a search string. But keep in mind that the best code is no code, so only roll your own Trie implementiation if the benefits outweight the costs.

  • Consider using a database if you want more complex queries, e.g. match for a substring or a regular expression.

Solution 4

I'd use a Set, in most cases HashSet is fine.

Solution 5

Perhaps it isn't required for your case but I think it's useful to mention that there are some space-efficient probabilistic algorithms. For example Bloom filter.

Share:
71,326
Ben
Author by

Ben

Updated on July 05, 2022

Comments

  • Ben
    Ben almost 2 years

    Basically I have about 1,000,000 strings, for each request I have to check if a String belongs to the list or not.

    I'm worried about the performance, so what's the best method? ArrayList? Hash?

  • Rup
    Rup almost 14 years
    In general I'd agree with you, but he's worried about lookup performance - won't that add lots of overhead?
  • duffymo
    duffymo almost 14 years
    Network latency is added, but you have the full power of SQL at your disposal. Another consideration is memory - a million strings at 32 chars each means ~64MB of RAM. It's a classic CPU versus memory trade-off. I'd benchmark it and see.
  • Carl Smotricz
    Carl Smotricz almost 14 years
    @Rup: Absolutely. And lots of opportunities for errors. If the data fits in memory (and it must, as they've already crammed it in) then it should be sought in memory.
  • Carl Smotricz
    Carl Smotricz almost 14 years
    @duffymo: For a straight test on existence, nothing you can do in a DB server will ever approach the performance of a contains() in a hash.
  • oopbase
    oopbase almost 14 years
    @Carl Smotricz&Rup: I didn't know that. So thank you for your comments.
  • Andreas Dolk
    Andreas Dolk almost 14 years
    What else? It has O(1) for add and contains.
  • Carl Smotricz
    Carl Smotricz almost 14 years
    krock's answer is slightly better at nudging the OP to an optimal solution: A TreeSet has O(log2(N)) performance, while a HashSet ideally has O(1).
  • duffymo
    duffymo almost 14 years
    @Carl - no doubt, I couldn't agree more, but that's the trade-off. If one million strings becomes a billion the database choice starts looking better.
  • Pontus Gagge
    Pontus Gagge almost 14 years
    @Carl: I'd be worried about whether those one million strings really should be in memory, in the first place... what better uses could the system find for all that memory? Moving the data to a database might improve performance tremendously... or not. Before refactoring, Ben should get some measurements in place.
  • krock
    krock almost 14 years
    thanks @Andreas_D, I added the quote from the Javadoc which states it has constant time performance.
  • matbrgz
    matbrgz almost 14 years
    The fun part comes when the million strings won't fit in main memory anymore.
  • matbrgz
    matbrgz almost 14 years
    @Carl, assuming that both equals and hashCode() are O(1), i.e. not considering string lengths.
  • Stephen C
    Stephen C almost 14 years
    @Pontus - this is a clear case of where you trade off memory usage against search time. A back-of-the-envelope calculation based on the required request rate should tell you whether a database will work. If not, the in-memory approach is a good bet.
  • Ben
    Ben almost 14 years
    Tks guys! I wanna do this becos this operation happens very frequently. It's a table in Mysql. I have to check daily-based duplication before insert. Becos it happens frequently, I design a latency of 5 seconds to bulk insert, which means I cannot check the duplication real-time to return feedback to the users. That's why I consider to load daily data into memory and do the check right in memory and in queued data. --continue below
  • Ben
    Ben almost 14 years
    I have enough memory, the string is user1_id:user2_id:day, such as 194324:2532241:29. The db server is on local network, but it's kind of overloading already. That's another issue why I use bulk insert with transaction. So I think hashset is the best choice so far, isn't it? :)
  • Lawrence Dol
    Lawrence Dol almost 14 years
    -1: He's worried about performance because he (a) has a huge data set, and (b) any 1/2 way decent programmer worth his salt should always consider the whether performance characteristics of an algorithm or data structure is appropriate to the task.
  • Lawrence Dol
    Lawrence Dol almost 14 years
    +1: Although, it occurs to me that since strings are individually allocated, it may not be particularly relevant how many, total, are in a specific hashmap, since a search is only going to hit a tiny percentage of them. More relevant might be the actually allocation pattern of the char arrays in the strings themselves, which the Java programmer has zero control over anyway (and that's a good thing).
  • mdma
    mdma almost 14 years
    @Software Monkey - the intent is that by putting the most frequently searched strings in it's own map, that there will be a high degree of hits for that map. A smaller hashmap with frequently used strings will have a higher cache hit rate than a larger map, since each cache line will correspond in the map backing array to several frequently used strings. Of course, as you say, this doesn't help with the allocation of the strings themselves. If that is a problem, then allocating the most common strings first may give better cache use, since the VM may allocate from the same region of the heap.
  • Reishin
    Reishin about 2 years
    I guess that's why BigData exists....... just use spark :)