Fastest way to search in a string collection

c# winforms linq collections string-search

25,009

Solution 1

You could consider doing the filtering task on a background thread which would invoke a callback method when it's done, or simply restart filtering if input is changed.

The general idea is to be able to use it like this:

public partial class YourForm : Form
{
    private readonly BackgroundWordFilter _filter;

    public YourForm()
    {
        InitializeComponent();

        // setup the background worker to return no more than 10 items,
        // and to set ListBox.DataSource when results are ready

        _filter = new BackgroundWordFilter
        (
            items: GetDictionaryItems(),
            maxItemsToMatch: 10,
            callback: results => 
              this.Invoke(new Action(() => listBox_choices.DataSource = results))
        );
    }

    private void textBox_search_TextChanged(object sender, EventArgs e)
    {
        // this will update the background worker's "current entry"
        _filter.SetCurrentEntry(textBox_search.Text);
    }
}

A rough sketch would be something like:

public class BackgroundWordFilter : IDisposable
{
    private readonly List<string> _items;
    private readonly AutoResetEvent _signal = new AutoResetEvent(false);
    private readonly Thread _workerThread;
    private readonly int _maxItemsToMatch;
    private readonly Action<List<string>> _callback;

    private volatile bool _shouldRun = true;
    private volatile string _currentEntry = null;

    public BackgroundWordFilter(
        List<string> items,
        int maxItemsToMatch,
        Action<List<string>> callback)
    {
        _items = items;
        _callback = callback;
        _maxItemsToMatch = maxItemsToMatch;

        // start the long-lived backgroud thread
        _workerThread = new Thread(WorkerLoop)
        {
            IsBackground = true,
            Priority = ThreadPriority.BelowNormal
        };

        _workerThread.Start();
    }

    public void SetCurrentEntry(string currentEntry)
    {
        // set the current entry and signal the worker thread
        _currentEntry = currentEntry;
        _signal.Set();
    }

    void WorkerLoop()
    {
        while (_shouldRun)
        {
            // wait here until there is a new entry
            _signal.WaitOne();
            if (!_shouldRun)
                return;

            var entry = _currentEntry;
            var results = new List<string>();

            // if there is nothing to process,
            // return an empty list
            if (string.IsNullOrEmpty(entry))
            {
                _callback(results);
                continue;
            }

            // do the search in a for-loop to 
            // allow early termination when current entry
            // is changed on a different thread
            foreach (var i in _items)
            {
                // if matched, add to the list of results
                if (i.Contains(entry))
                    results.Add(i);

                // check if the current entry was updated in the meantime,
                // or we found enough items
                if (entry != _currentEntry || results.Count >= _maxItemsToMatch)
                    break;
            }

            if (entry == _currentEntry)
                _callback(results);
        }
    }

    public void Dispose()
    {
        // we are using AutoResetEvent and a background thread
        // and therefore must dispose it explicitly
        Dispose(true);
    }

    private void Dispose(bool disposing)
    {
        if (!disposing)
            return;

        // shutdown the thread
        if (_workerThread.IsAlive)
        {
            _shouldRun = false;
            _currentEntry = null;
            _signal.Set();
            _workerThread.Join();
        }

        // if targetting .NET 3.5 or older, we have to
        // use the explicit IDisposable implementation
        (_signal as IDisposable).Dispose();
    }
}

Also, you should actually dispose the _filter instance when the parent Form is disposed. This means you should open and edit your Form's Dispose method (inside the YourForm.Designer.cs file) to look something like:

// inside "xxxxxx.Designer.cs"
protected override void Dispose(bool disposing)
{
    if (disposing)
    {
        if (_filter != null)
            _filter.Dispose();

        // this part is added by Visual Studio designer
        if (components != null)
            components.Dispose();
    }

    base.Dispose(disposing);
}

On my machine, it works pretty quickly, so you should test and profile this before going for a more complex solution.

That being said, a "more complex solution" would possibly be to store the last couple of results in a dictionary, and then only filter them if it turns out that the new entry differs by only the first of last character.

Solution 2

I've done some testing, and searching a list of 120,000 items and populating a new list with the entries takes a negligible amount of time (about a 1/50th of a second even if all strings are matched).

The problem you're seeing must therefore be coming from the populating of the data source, here:

listBox_choices.DataSource = ...

I suspect you are simply putting too many items into the listbox.

Perhaps you should try limiting it to the first 20 entries, like so:

listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text))
    .Take(20).ToList();

Also note (as others have pointed out) that you are accessing the TextBox.Text property for each item in allUsers. This can easily be fixed as follows:

string target = textBox_search.Text;
listBox_choices.DataSource = allUsers.Where(item => item.Contains(target))
    .Take(20).ToList();

However, I timed how long it takes to access TextBox.Text 500,000 times and it only took 0.7 seconds, far less than the 1 - 3 seconds mentioned in the OP. Still, this is a worthwhile optimisation.

Solution 3

Use Suffix tree as index. Or rather just build a sorted dictionary that associates every suffix of every name with the list of corresponding names.

For input:

Abraham
Barbara
Abram

The structure would look like:

a -> Barbara
ab -> Abram
abraham -> Abraham
abram -> Abram
am -> Abraham, Abram
aham -> Abraham
ara -> Barbara
arbara -> Barbara
bara -> Barbara
barbara -> Barbara
bram -> Abram
braham -> Abraham
ham -> Abraham
m -> Abraham, Abram
raham -> Abraham
ram -> Abram
rbara -> Barbara

Search algorithm

Assume user input "bra".

Bisect the dictionary on user input to find the user input or the position where it could go. This way we find "barbara" - last key lower than "bra". It is called lower bound for "bra". Search will take logarithmic time.
Iterate from the found key onwards until user input no longer matches. This would give "bram" -> Abram and "braham" -> Abraham.
Concatenate iteration result (Abram, Abraham) and output it.

Such trees are designed for quick search of substrings. It performance is close to O(log n). I believe this approach will work fast enough to be used by GUI thread directly. Moreover it will work faster then threaded solution due to absence of synchronization overhead.

Solution 4

You need either a text search engine (like Lucene.Net), or database (you may consider an embedded one like SQL CE, SQLite, etc.). In other words, you need an indexed search. Hash-based search isn't applicable here, because you searching for sub-string, while hash-based search is well for searching for exact value.

Otherwise it will be an iterative search with looping through the collection.

Solution 5

It might also be useful to have a "debounce" type of event. This differs from throttling in that it waits a period of time (for example, 200 ms) for changes to finish before firing the event.

See Debounce and Throttle: a visual explanation for more information about debouncing. I appreciate that this article is JavaScript focused, instead of C#, but the principle applies.

The advantage of this is that it doesn't search when you're still entering your query. It should then stop trying to perform two searches at once.

View more solutions

25,009

etaiso

Updated on July 05, 2022

Comments

etaiso almost 2 years
Problem:

I have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection.

The search method will occur every time the user change the text of a TextBox and the result should be the strings that contain the text in TextBox.

I don't have to change the list, just pull the results and put them in a ListBox.

What I've tried so far:

I tried with two different collections/containers, which I'm dumping the string entries from an external text file (once, of course):
1. List<string> allUsers;
2. HashSet<string> allUsers;
With the following LINQ query:

allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

My search event (fires when user change the search text):
```
private void textBox_search_TextChanged(object sender, EventArgs e)
{
    if (textBox_search.Text.Length > 2)
    {
        listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
    }
    else
    {
        listBox_choices.DataSource = null;
    }
}
```
Results:

Both gave me a poor response time (around 1-3 seconds between each key press).

Question:

Where do you think my bottleneck is? The collection I've used? The search method? Both?

How can I get better performance and more fluent functionality?
- Dennis over 10 years
  
  HashSet<T> won't help you here, because you're searching the part of string.
- WhileTrueSleep over 10 years
  
  You may try a performance check while using parallel linq stackoverflow.com/questions/9138540/…
- Adriano Repetti over 10 years
  
  Moreover you don't need ToList(). At least you do not need to create a true list. I would wrap IEnumerable within a class that implement only needed parts of IList.
- Adriano Repetti over 10 years
  
  @WhileTrueSleep partitioning input and locking output collection may quickly become hard to tune.
- WhileTrueSleep over 10 years
  
  i wonder why you have these problems - i tested this on my machine and im coming up with results like 30ms (linq) or 8ms (using plinq)
- Frames Catherine White over 10 years
  
  Do, you actually mean Contain? Or do you mean starts with? (Or perhaps, wordwise starts-with) If you mean starts-with, a treebased approch might work really well. In particular a Trie, or a variation there of. Scrolling down I see Groo suggested this also.
- CodesInChaos over 10 years
  
  Check out Suffix arrays.
- Eric Lippert over 10 years
  
  Don't ask "what's the fastest way to", because that would take literally weeks to years of research. Rather, say "I need a solution that runs in less than 30 ms", or whatever your performance goal is. You don't need the fastest device, you need a fast enough device.
- Eric Lippert over 10 years
  
  Also, get a profiler. Don't guess about where the slow part is; such guesses are often wrong. The bottleneck might be somewhere surprising.
- Basilevs over 10 years
  
  @EricLippert O(N) searches are always slow on large enough inputs. That's why theory wins. Sometimes.
- rob over 10 years
  
  As @CodesInChaos mentioned, there are much better data structures for your use-case, such as a suffix array.
- Andris over 10 years
  
  @Basilevs Actually the bottle-neck is the repeated call to textBox_search.Text which makes native API calls to get the content of the textbox. So practice wins yet again. See my answer bellow for details.
- Eric Lippert over 10 years
  
  @Basilevs: I once wrote a lovely O(1) hash table that was extremely slow in practice. I profiled it to find out why and discovered that on every search it was calling a method that -- no kidding -- ended up asking the registry "are we in Thailand right now?". Not caching whether the user is in Thailand was the bottleneck in that O(1) code. The location of the bottleneck can be deeply counterintuitive. Use a profiler.
- SolutionYogi over 10 years
  
  +100 for profiling. In my current product, we were doing millions of calculations over a list. It was not fast enough and we thought that we need to implement some sort of caching mechanism for our calculations. When we profiled the code, we found that the LINQ Sum method was a hot spot. Diving deeper, we learned that SUM method was getting called millions of times and was generating GC pressure to dispose off all the enumerators. We added ListSum method which indexed over the list using for loop and we saw 50% speed bump with reduced memory usage.
- SolutionYogi over 10 years
  
  Not in million years I would have guessed that LINQ SUM method is slowing us down. Please listen to experts and do not even attempt to guess which part is slow. Profile the code, look at the hot spot, eliminate it. Keep repeating till you achieve desired performance. I have used both ANTS profiler and dotTrace profiler in production and I highly recommend dotTrace profiler for the 'subsystems' feature (youtube.com/watch?v=iEU_SU6oV1c). It makes it very easy to identify bottlenecks in your system.
etaiso over 10 years

I know it's a possibility. But my question here is if and how can I shorten this process?
animaonline over 10 years

@etaiso it shouldn't really be a problem unless you're developing on some really low end hardware, make sure you're not running the debugger, CTRL + F5
Adriano Repetti over 10 years

Sure parallel overhead isn't much higher thant String.Contains() itself? A parallel execution is possible but not so trivial.
Tim Schmelter over 10 years

This is not a good candidate for PLINQ since the method String.Contains is not expensive. msdn.microsoft.com/en-us/library/dd997399.aspx
animaonline over 10 years

@TimSchmelter when we're talking about tons of strings, it is!
animaonline over 10 years

@TimSchmelter I'm not saying that it will change the implementation of that method, but we're talking about an IEnumerable with lots of strings...
Tarec over 10 years

Assuming he's filtering his records with a prefix.
animaonline over 10 years

@TimSchmelter I have no idea what you're trying to prove, using the code I provided will most likely increase the performance for the OP, and here's a benchmark that demonstrates how it works: pastebin.com/ATYa2BGt --- Period ---
Tarec over 10 years

Notice he's using String.Contains() method instead of String.StartsWith(), so it may not be exactly what we're looking for. Still - your idea is doubtlessly better, than ordinary filtering with StartsWith() extension in prefix scenario.
Tim Schmelter over 10 years

@animaonline: I stay corrected. The AsParallel seems to be really faster with those sample strings although that contradicts MSDN.
animaonline over 10 years

@TimSchmelter have a nice day
Tarec over 10 years

It does not contradict MSDN. The main factor of using PLINQ is "a cost of each delegate multiplied by the number of elements in the source collection". Processing 120k elements with string comparison is definately big enough task to justify paralelization.
animaonline over 10 years

@Tarec please... let's end this conversation, we got it all figured out XD
Tarec over 10 years

It is important to know not only that it is faster, but most of all WHY is it faster. EOT.
Adriano Repetti over 10 years

@Tarec just FYI, I'm not here to discuss when it's better to make something parallel, tested in release (animaonline benchmark code) and parallel version is slightly slower. Tested in debug and parallel version is pretty slower.
Adriano Repetti over 10 years

I don't agree to make it parallel but I absolutely agree with the other two points. It may even be enough to satisfy UI requirements.
animaonline over 10 years

@Adriano don't forget that it also depends on your hardware, number of cores and the available memory, etc.
etaiso over 10 years

Thanks Matthew. I tried your solution but I don't think the problem is with the population of the ListBox. I think that I need a better approach as this kind of filtering is very naive (for e.g - search for "abc" returns 0 results, then I should not even look for "abcX" and so on..)
Adriano Repetti over 10 years

@animaonline yes, that's my point. Performance are pretty similar (in my case little bit worse) but very hw-dependent so I wouldn't suggest to make it parallel as solution because of that.
Adriano Repetti over 10 years

@etaiso right (even if Matthew's solution may work great if you do not really need to preset all matches), that's why I suggested as second step to refine search instead of performing a full search each time.
Matthew Watson over 10 years

@etaiso Well, the searching time is negligible like I said. I tried it with 120,000 strings and searching for a very long string that gave no matches and a very short string that gave many matches both took under 1/50th second.
Groo over 10 years

As some have already pointed out, OP does not wants to limit the results to prefixes only (i.e. he uses Contains, not StartsWith). On a side note, it's usually better to use the generic ContainsKey method when searching for a key to avoid boxing, and even better to use TryGetValue to avoid two lookups.
etaiso over 10 years

Forgot to mention that but I'm using .NET 3.5 so AsParallel is not an option.
Adriano Repetti over 10 years

@Groo you're right, as I said it's for illustration purposes only. The point of that code isn't a working solution but a hint: if you tried everything else - avoid copies, refine search, move it to another thread - and it isn't enough then you have to change data structure you're using. Example is for beginning of string just to stay simple.
etaiso over 10 years

@Adriano thanks for clear and detailed answer! I agree with most of things you mentioned but as Groo said, the last part of keeping the data organized is not applicable in my case. But I think maybe to hold a similar dictionary with keys as the contained letter (though there will still be duplicates)
etaiso over 10 years

after a quick check and calculation the "contained letter" idea is not good for only one character (and if we go for combinations of two or more we will end up with a very large hash table)
Adriano Repetti over 10 years

@etaiso yes, you may keep a list of two letters (to quickly reduce sub-lists) but a true tree may works better (each letter is linked to its successors, no matters where it's inside the string so for "HOME" you have "H->O", "O->M" and "M->E". If you're searching for "om" you'll quickly find it. Problem is it becomes pretty more complicate and it may be too much for you scenario (IMO).
etaiso over 10 years

I just tested your solution and it works perfectly! Nicely done. The only problem I have is that I can't make the _signal.Dispose(); to compile (error about protection-level).
Groo over 10 years

@etaiso: that's weird, where exaclty you calling _signal.Dispose() Is it somewhere outside the BackgroundWordFilter class?
etaiso over 10 years

No. Exactly as in your snippet.
etaiso over 10 years

I don't think this will improve anything as the BeginUpdate and EndUpdate are intended to use when adding items individually or when using AddRange().
Andris over 10 years

It depends on how the DataSource property is implemented. It might be worth a try.
Groo over 10 years

That's weird. AutoResetEvent inherits from WaitHandle, which implements IDisposable.Dispose(), meaning that Dispose must be a public method. Is this running on Mono, perhaps? I really haven't encountered anything like that before.
Matthew Watson over 10 years

@Groo It's explicit implementation, meaning you can't call it directly. You're supposed to use either a using block, or call WaitHandle.Close()
Frames Catherine White over 10 years

If he does mean starts with, then the Trie can be combined with the background worker approach for improved performance
Groo over 10 years

@MatthewWatson: Can you back that up with a link of some sort? MSDN says it's a public method, and I've been using it that way for ages (and it works on my machine for this concrete snippet too). I don't see why they would choose to hide it, and then also provide an explicit implementation.
Alex over 10 years

@Groo your own link to AutoResetEvent backs it up. Scroll down to "Explicit interface implementations"
Groo over 10 years

@AlexG: that only means it provides an explicit implementation. The Dispose method is public. Are you saying that new AutoResetEvent(true).Dispose(); won't compile on your machine?
Groo over 10 years

Ok, now it makes sense, the method was made public in .NET 4. The MSDN page for .NET 4 lists it under public methods, while the page for .NET 3.5 shows it under protected ones. That also explains why there is a conditional define in the Mono source for WaitHandle.
Groo over 10 years

@etaiso: ok, it turns out that the method was protected prior to .NET 4 (and I presume you're using .NET 3.5 since you didn't complain about the lambda. This means you can simply change that line to (_signal as IDisposable).Dispose();, unless you decide to switch to .NET 4.
Matthew Watson over 10 years

@Groo Sorry, I should have mentioned I was talking about an older version of .Net - sorry about the confusion! However note that he doesn't need to cast - he can call .Close() instead, which itself calls .Dispose().
Groo over 10 years

@Matthew: sure, that's an even better idea, I guess that's how they planned for it to be used. I wonder if static code analyzer would complain about pulling that off, though. :-D
Adriano Repetti over 10 years

He's not working with beginning of string (that's why he's using String.Contains()). With Contains() a sorted list doesn't change performance.
CodesInChaos over 10 years

From what I know suffix arrays are usually a better choice than suffix trees. Easier to implement and lower memory usage.
Basilevs over 10 years

I propose SortedList which is very easy to build and maintain at cost of memory overhead which can be minimized by providing lists capacities.
Basilevs over 10 years

Also, it seems arrays (and original ST) are designed to handle large texts, while here we have large amount of short chunks which is different task.
hardsky over 10 years

Yes, with 'Contains' it's useless. I like suggestion with sufix tree stackoverflow.com/a/21383731/994849 There is a lot interesting answers in the thread, but it depends how much time he can spend on this task.
Gabe over 10 years

Does textBox_search.Text contribute a measurable amount to the time? Getting the Text property on a text box once for each of 120k strings probably sends 120k messages to the edit control window.
Gabe over 10 years

+1 while everybody else was trying to optimize the search that only takes 30ms, you're the only person who recognized that the problem is actually in filling the list box.
Andris over 10 years

@Gabe Yes it does. See my answer for details.
Matthew Watson over 10 years

@Gabe Good point! I'll add that to the answer for completeness.
etaiso over 10 years

Hashmap with what key? I want to be able to find keywords that contained in the strings.
OrangeDog over 10 years

+1 for the good approach, but I'd use a hash map or actual search tree rather than manually searching a list.
OrangeDog over 10 years

Indexing is a hash-based search. You just add all sub-strings as keys instead of just the value.
Dennis over 10 years

@OrangeDog: disagree. indexed search can be implemented as hash-based search by index keys, but it isn't necessary, and it isn't a hash-based search by the string value itself.
jnovacho over 10 years

Is there any advantage of using suffix tree instead of prefix tree?
Gabe over 10 years

Your profiling results are very different from mine. I was able to search 120k strings in 30ms, but adding them to the listbox took 4500ms. It sounds like you're adding 2.5M strings to the listbox in under 600ms. How is that possible?
Basilevs over 10 years

@OrangeDog, how do use hash map to search for prefix? I could not find a collection with built-in bisection in .NET could you propose one to use as search tree?
Basilevs over 10 years

@jnovacho, there is no advantage except for ability to search for a substring
OrangeDog over 10 years

@Basilevs you add all the prefixes as hash keys. A trie is better for prefix-search though.
Basilevs over 10 years

Linking to all substrings would increase collection size proportionally to the word length (while trie might save up on string length we still have to deal with references). As it is already pretty large I would not try that.
user over 10 years

@Dennis Agree. +1 to cancel the ghost -1.
OrangeDog over 10 years

The question is only concerned with access time, not memory usage. A search tree is preferred to a hash map because hash collision is going to erode the O(1) complexity advantage the hash map would otherwise have.
Andris over 10 years

@Gabe While profiling I used an input where the filter text eliminated a big part of the original list. If I use an input where the filter text removes nothing from the list I get similar results to yours. I will update my response to clarify what I have measured.
Caramiriel over 10 years

+1 because implementations like a text search engine have smart(er) optimizations than string.Contains. Ie. searching for ba in bcaaaabaa will result in a (indexed) skip list. The first b is considered, but doesn't match because the next is a c, so it will skip to the next b.
dada over 10 years

for he key you can put the number in the list, if you want more complciated you can add number plus name the choice is yours.
dada over 10 years

for the rest either i didn't read everything either there was a bad explanation (probably both ;) ) [quote] have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection. [/quote] I thought it was just a string search.
Jennifer Miles over 9 years

See for a C# implementation of an event throttler the EventThrotler class in the Algorithmia library: github.com/SolutionsDesign/Algorithmia/blob/master/…