LINQ Performance for Large Collections

21,526

Solution 1

In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.

If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.

I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.

The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.

public class IndexedList {

    private class Index : Dictionary<string, List<string>> {

        private int _indexLength;

        public Index(int indexLength) {
            _indexLength = indexLength;
        }

        public void Add(string value) {
            if (value.Length >= _indexLength) {
                string key = value.Substring(0, _indexLength);
                List<string> list;
                if (!this.TryGetValue(key, out list)) {
                    Add(key, list = new List<string>());
                }
                list.Add(value);
            }
        }

        public IEnumerable<string> Find(string query, int limit) {
            return
                this[query.Substring(0, _indexLength)]
                .Where(s => s.Length > query.Length && s.StartsWith(query))
                .Take(limit);
        }

    }

    private Index _index1;
    private Index _index2;
    private Index _index4;
    private Index _index8;

    public IndexedList(IEnumerable<string> values) {
        _index1 = new Index(1);
        _index2 = new Index(2);
        _index4 = new Index(4);
        _index8 = new Index(8);
        foreach (string value in values) {
            _index1.Add(value);
            _index2.Add(value);
            _index4.Add(value);
            _index8.Add(value);
        }
    }

    public IEnumerable<string> Find(string query, int limit) {
        if (query.Length >= 8) return _index8.Find(query, limit);
        if (query.Length >= 4) return _index4.Find(query,limit);
        if (query.Length >= 2) return _index2.Find(query,limit);
        return _index1.Find(query, limit);
    }

}

Solution 2

I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).

This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.

// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
    int low = 0, high = list.Count - 1;
    while (high > low) {
        int mid = (low + high) / 2;
        if (list[mid] < query)
            low = mid + 1;
        else
            high = mid - 1;
    }

    while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
        yield return list[low];
        low++;
    }
}

Solution 3

If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.

In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".

None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.

Solution 4

If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.

Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.

About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.

Share:
21,526
Peter J
Author by

Peter J

Updated on July 09, 2022

Comments

  • Peter J
    Peter J almost 2 years

    I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:

    Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:

    public static string ReturnSomething(string query, int limit)
    {
      StringBuilder sb = new StringBuilder();
      foreach (var stringitem in MyCollection.Where(
          x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
      {
          sb.Append(stringitem);
      }
    
      return sb.ToString();
    }
    

    It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?

  • Peter J
    Peter J about 15 years
    Excellent! High performance and exactly what I was looking for. Would you recommend this method (modified of course) to query into properties on a collection of non-string objects?
  • Guffa
    Guffa about 15 years
    Yes, you could make the Index class generic and use a HashSet instead of a List, then you could create indexes for different properties and intersect the HashSets to narrow down the items to search.
  • Sam
    Sam over 12 years
    What about strings shorter than indexLength - Add() won't store them and Find() won't find them?
  • Guffa
    Guffa over 12 years
    @Sam: That is correct. An index for n characters will only contain the strings that are at least n characters. That's why it's good to create indexes for different lengths like in the example.
  • Guffa
    Guffa over 9 years
    Why the downvote? If you don't explain what it is that you think is wrong, it can't improve the answer.