Fastest way to search in a string collection

25,009

Solution 1

You could consider doing the filtering task on a background thread which would invoke a callback method when it's done, or simply restart filtering if input is changed.

The general idea is to be able to use it like this:

public partial class YourForm : Form
{
    private readonly BackgroundWordFilter _filter;

    public YourForm()
    {
        InitializeComponent();

        // setup the background worker to return no more than 10 items,
        // and to set ListBox.DataSource when results are ready

        _filter = new BackgroundWordFilter
        (
            items: GetDictionaryItems(),
            maxItemsToMatch: 10,
            callback: results => 
              this.Invoke(new Action(() => listBox_choices.DataSource = results))
        );
    }

    private void textBox_search_TextChanged(object sender, EventArgs e)
    {
        // this will update the background worker's "current entry"
        _filter.SetCurrentEntry(textBox_search.Text);
    }
}

A rough sketch would be something like:

public class BackgroundWordFilter : IDisposable
{
    private readonly List<string> _items;
    private readonly AutoResetEvent _signal = new AutoResetEvent(false);
    private readonly Thread _workerThread;
    private readonly int _maxItemsToMatch;
    private readonly Action<List<string>> _callback;

    private volatile bool _shouldRun = true;
    private volatile string _currentEntry = null;

    public BackgroundWordFilter(
        List<string> items,
        int maxItemsToMatch,
        Action<List<string>> callback)
    {
        _items = items;
        _callback = callback;
        _maxItemsToMatch = maxItemsToMatch;

        // start the long-lived backgroud thread
        _workerThread = new Thread(WorkerLoop)
        {
            IsBackground = true,
            Priority = ThreadPriority.BelowNormal
        };

        _workerThread.Start();
    }

    public void SetCurrentEntry(string currentEntry)
    {
        // set the current entry and signal the worker thread
        _currentEntry = currentEntry;
        _signal.Set();
    }

    void WorkerLoop()
    {
        while (_shouldRun)
        {
            // wait here until there is a new entry
            _signal.WaitOne();
            if (!_shouldRun)
                return;

            var entry = _currentEntry;
            var results = new List<string>();

            // if there is nothing to process,
            // return an empty list
            if (string.IsNullOrEmpty(entry))
            {
                _callback(results);
                continue;
            }

            // do the search in a for-loop to 
            // allow early termination when current entry
            // is changed on a different thread
            foreach (var i in _items)
            {
                // if matched, add to the list of results
                if (i.Contains(entry))
                    results.Add(i);

                // check if the current entry was updated in the meantime,
                // or we found enough items
                if (entry != _currentEntry || results.Count >= _maxItemsToMatch)
                    break;
            }

            if (entry == _currentEntry)
                _callback(results);
        }
    }

    public void Dispose()
    {
        // we are using AutoResetEvent and a background thread
        // and therefore must dispose it explicitly
        Dispose(true);
    }

    private void Dispose(bool disposing)
    {
        if (!disposing)
            return;

        // shutdown the thread
        if (_workerThread.IsAlive)
        {
            _shouldRun = false;
            _currentEntry = null;
            _signal.Set();
            _workerThread.Join();
        }

        // if targetting .NET 3.5 or older, we have to
        // use the explicit IDisposable implementation
        (_signal as IDisposable).Dispose();
    }
}

Also, you should actually dispose the _filter instance when the parent Form is disposed. This means you should open and edit your Form's Dispose method (inside the YourForm.Designer.cs file) to look something like:

// inside "xxxxxx.Designer.cs"
protected override void Dispose(bool disposing)
{
    if (disposing)
    {
        if (_filter != null)
            _filter.Dispose();

        // this part is added by Visual Studio designer
        if (components != null)
            components.Dispose();
    }

    base.Dispose(disposing);
}

On my machine, it works pretty quickly, so you should test and profile this before going for a more complex solution.

That being said, a "more complex solution" would possibly be to store the last couple of results in a dictionary, and then only filter them if it turns out that the new entry differs by only the first of last character.

Solution 2

I've done some testing, and searching a list of 120,000 items and populating a new list with the entries takes a negligible amount of time (about a 1/50th of a second even if all strings are matched).

The problem you're seeing must therefore be coming from the populating of the data source, here:

listBox_choices.DataSource = ...

I suspect you are simply putting too many items into the listbox.

Perhaps you should try limiting it to the first 20 entries, like so:

listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text))
    .Take(20).ToList();

Also note (as others have pointed out) that you are accessing the TextBox.Text property for each item in allUsers. This can easily be fixed as follows:

string target = textBox_search.Text;
listBox_choices.DataSource = allUsers.Where(item => item.Contains(target))
    .Take(20).ToList();

However, I timed how long it takes to access TextBox.Text 500,000 times and it only took 0.7 seconds, far less than the 1 - 3 seconds mentioned in the OP. Still, this is a worthwhile optimisation.

Solution 3

Use Suffix tree as index. Or rather just build a sorted dictionary that associates every suffix of every name with the list of corresponding names.

For input:

Abraham
Barbara
Abram

The structure would look like:

a -> Barbara
ab -> Abram
abraham -> Abraham
abram -> Abram
am -> Abraham, Abram
aham -> Abraham
ara -> Barbara
arbara -> Barbara
bara -> Barbara
barbara -> Barbara
bram -> Abram
braham -> Abraham
ham -> Abraham
m -> Abraham, Abram
raham -> Abraham
ram -> Abram
rbara -> Barbara

Search algorithm

Assume user input "bra".

  1. Bisect the dictionary on user input to find the user input or the position where it could go. This way we find "barbara" - last key lower than "bra". It is called lower bound for "bra". Search will take logarithmic time.
  2. Iterate from the found key onwards until user input no longer matches. This would give "bram" -> Abram and "braham" -> Abraham.
  3. Concatenate iteration result (Abram, Abraham) and output it.

Such trees are designed for quick search of substrings. It performance is close to O(log n). I believe this approach will work fast enough to be used by GUI thread directly. Moreover it will work faster then threaded solution due to absence of synchronization overhead.

Solution 4

You need either a text search engine (like Lucene.Net), or database (you may consider an embedded one like SQL CE, SQLite, etc.). In other words, you need an indexed search. Hash-based search isn't applicable here, because you searching for sub-string, while hash-based search is well for searching for exact value.

Otherwise it will be an iterative search with looping through the collection.

Solution 5

It might also be useful to have a "debounce" type of event. This differs from throttling in that it waits a period of time (for example, 200 ms) for changes to finish before firing the event.

See Debounce and Throttle: a visual explanation for more information about debouncing. I appreciate that this article is JavaScript focused, instead of C#, but the principle applies.

The advantage of this is that it doesn't search when you're still entering your query. It should then stop trying to perform two searches at once.

Share:
25,009

Related videos on Youtube

etaiso
Author by

etaiso

Updated on July 05, 2022

Comments

  • etaiso
    etaiso almost 2 years

    Problem:

    I have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection.

    The search method will occur every time the user change the text of a TextBox and the result should be the strings that contain the text in TextBox.

    I don't have to change the list, just pull the results and put them in a ListBox.

    What I've tried so far:

    I tried with two different collections/containers, which I'm dumping the string entries from an external text file (once, of course):

    1. List<string> allUsers;
    2. HashSet<string> allUsers;

    With the following LINQ query:

    allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

    My search event (fires when user change the search text):

    private void textBox_search_TextChanged(object sender, EventArgs e)
    {
        if (textBox_search.Text.Length > 2)
        {
            listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
        }
        else
        {
            listBox_choices.DataSource = null;
        }
    }
    

    Results:

    Both gave me a poor response time (around 1-3 seconds between each key press).

    Question:

    Where do you think my bottleneck is? The collection I've used? The search method? Both?

    How can I get better performance and more fluent functionality?

    • Dennis
      Dennis over 10 years
      HashSet<T> won't help you here, because you're searching the part of string.
    • WhileTrueSleep
      WhileTrueSleep over 10 years
      You may try a performance check while using parallel linq stackoverflow.com/questions/9138540/…
    • Adriano Repetti
      Adriano Repetti over 10 years
      Moreover you don't need ToList(). At least you do not need to create a true list. I would wrap IEnumerable within a class that implement only needed parts of IList.
    • Adriano Repetti
      Adriano Repetti over 10 years
      @WhileTrueSleep partitioning input and locking output collection may quickly become hard to tune.
    • WhileTrueSleep
      WhileTrueSleep over 10 years
      i wonder why you have these problems - i tested this on my machine and im coming up with results like 30ms (linq) or 8ms (using plinq)
    • Frames Catherine White
      Frames Catherine White over 10 years
      Do, you actually mean Contain? Or do you mean starts with? (Or perhaps, wordwise starts-with) If you mean starts-with, a treebased approch might work really well. In particular a Trie, or a variation there of. Scrolling down I see Groo suggested this also.
    • CodesInChaos
      CodesInChaos over 10 years
      Check out Suffix arrays.
    • Eric Lippert
      Eric Lippert over 10 years
      Don't ask "what's the fastest way to", because that would take literally weeks to years of research. Rather, say "I need a solution that runs in less than 30 ms", or whatever your performance goal is. You don't need the fastest device, you need a fast enough device.
    • Eric Lippert
      Eric Lippert over 10 years
      Also, get a profiler. Don't guess about where the slow part is; such guesses are often wrong. The bottleneck might be somewhere surprising.
    • Basilevs
      Basilevs over 10 years
      @EricLippert O(N) searches are always slow on large enough inputs. That's why theory wins. Sometimes.
    • rob
      rob over 10 years
      As @CodesInChaos mentioned, there are much better data structures for your use-case, such as a suffix array.
    • Andris
      Andris over 10 years
      @Basilevs Actually the bottle-neck is the repeated call to textBox_search.Text which makes native API calls to get the content of the textbox. So practice wins yet again. See my answer bellow for details.
    • Eric Lippert
      Eric Lippert over 10 years
      @Basilevs: I once wrote a lovely O(1) hash table that was extremely slow in practice. I profiled it to find out why and discovered that on every search it was calling a method that -- no kidding -- ended up asking the registry "are we in Thailand right now?". Not caching whether the user is in Thailand was the bottleneck in that O(1) code. The location of the bottleneck can be deeply counterintuitive. Use a profiler.
    • SolutionYogi
      SolutionYogi over 10 years
      +100 for profiling. In my current product, we were doing millions of calculations over a list. It was not fast enough and we thought that we need to implement some sort of caching mechanism for our calculations. When we profiled the code, we found that the LINQ Sum method was a hot spot. Diving deeper, we learned that SUM method was getting called millions of times and was generating GC pressure to dispose off all the enumerators. We added ListSum method which indexed over the list using for loop and we saw 50% speed bump with reduced memory usage.
    • SolutionYogi
      SolutionYogi over 10 years
      Not in million years I would have guessed that LINQ SUM method is slowing us down. Please listen to experts and do not even attempt to guess which part is slow. Profile the code, look at the hot spot, eliminate it. Keep repeating till you achieve desired performance. I have used both ANTS profiler and dotTrace profiler in production and I highly recommend dotTrace profiler for the 'subsystems' feature (youtube.com/watch?v=iEU_SU6oV1c). It makes it very easy to identify bottlenecks in your system.
  • etaiso
    etaiso over 10 years
    I know it's a possibility. But my question here is if and how can I shorten this process?
  • animaonline
    animaonline over 10 years
    @etaiso it shouldn't really be a problem unless you're developing on some really low end hardware, make sure you're not running the debugger, CTRL + F5
  • Adriano Repetti
    Adriano Repetti over 10 years
    Sure parallel overhead isn't much higher thant String.Contains() itself? A parallel execution is possible but not so trivial.
  • Tim Schmelter
    Tim Schmelter over 10 years
    This is not a good candidate for PLINQ since the method String.Contains is not expensive. msdn.microsoft.com/en-us/library/dd997399.aspx
  • animaonline
    animaonline over 10 years
    @TimSchmelter when we're talking about tons of strings, it is!
  • animaonline
    animaonline over 10 years
    @TimSchmelter I'm not saying that it will change the implementation of that method, but we're talking about an IEnumerable with lots of strings...
  • Tarec
    Tarec over 10 years
    Assuming he's filtering his records with a prefix.
  • animaonline
    animaonline over 10 years
    @TimSchmelter I have no idea what you're trying to prove, using the code I provided will most likely increase the performance for the OP, and here's a benchmark that demonstrates how it works: pastebin.com/ATYa2BGt --- Period ---
  • Tarec
    Tarec over 10 years
    Notice he's using String.Contains() method instead of String.StartsWith(), so it may not be exactly what we're looking for. Still - your idea is doubtlessly better, than ordinary filtering with StartsWith() extension in prefix scenario.
  • Tim Schmelter
    Tim Schmelter over 10 years
    @animaonline: I stay corrected. The AsParallel seems to be really faster with those sample strings although that contradicts MSDN.
  • animaonline
    animaonline over 10 years
    @TimSchmelter have a nice day
  • Tarec
    Tarec over 10 years
    It does not contradict MSDN. The main factor of using PLINQ is "a cost of each delegate multiplied by the number of elements in the source collection". Processing 120k elements with string comparison is definately big enough task to justify paralelization.
  • animaonline
    animaonline over 10 years
    @Tarec please... let's end this conversation, we got it all figured out XD
  • Tarec
    Tarec over 10 years
    It is important to know not only that it is faster, but most of all WHY is it faster. EOT.
  • Adriano Repetti
    Adriano Repetti over 10 years
    @Tarec just FYI, I'm not here to discuss when it's better to make something parallel, tested in release (animaonline benchmark code) and parallel version is slightly slower. Tested in debug and parallel version is pretty slower.
  • Adriano Repetti
    Adriano Repetti over 10 years
    I don't agree to make it parallel but I absolutely agree with the other two points. It may even be enough to satisfy UI requirements.
  • animaonline
    animaonline over 10 years
    @Adriano don't forget that it also depends on your hardware, number of cores and the available memory, etc.
  • etaiso
    etaiso over 10 years
    Thanks Matthew. I tried your solution but I don't think the problem is with the population of the ListBox. I think that I need a better approach as this kind of filtering is very naive (for e.g - search for "abc" returns 0 results, then I should not even look for "abcX" and so on..)
  • Adriano Repetti
    Adriano Repetti over 10 years
    @animaonline yes, that's my point. Performance are pretty similar (in my case little bit worse) but very hw-dependent so I wouldn't suggest to make it parallel as solution because of that.
  • Adriano Repetti
    Adriano Repetti over 10 years
    @etaiso right (even if Matthew's solution may work great if you do not really need to preset all matches), that's why I suggested as second step to refine search instead of performing a full search each time.
  • Matthew Watson
    Matthew Watson over 10 years
    @etaiso Well, the searching time is negligible like I said. I tried it with 120,000 strings and searching for a very long string that gave no matches and a very short string that gave many matches both took under 1/50th second.
  • Groo
    Groo over 10 years
    As some have already pointed out, OP does not wants to limit the results to prefixes only (i.e. he uses Contains, not StartsWith). On a side note, it's usually better to use the generic ContainsKey method when searching for a key to avoid boxing, and even better to use TryGetValue to avoid two lookups.
  • etaiso
    etaiso over 10 years
    Forgot to mention that but I'm using .NET 3.5 so AsParallel is not an option.
  • Adriano Repetti
    Adriano Repetti over 10 years
    @Groo you're right, as I said it's for illustration purposes only. The point of that code isn't a working solution but a hint: if you tried everything else - avoid copies, refine search, move it to another thread - and it isn't enough then you have to change data structure you're using. Example is for beginning of string just to stay simple.
  • etaiso
    etaiso over 10 years
    @Adriano thanks for clear and detailed answer! I agree with most of things you mentioned but as Groo said, the last part of keeping the data organized is not applicable in my case. But I think maybe to hold a similar dictionary with keys as the contained letter (though there will still be duplicates)
  • etaiso
    etaiso over 10 years
    after a quick check and calculation the "contained letter" idea is not good for only one character (and if we go for combinations of two or more we will end up with a very large hash table)
  • Adriano Repetti
    Adriano Repetti over 10 years
    @etaiso yes, you may keep a list of two letters (to quickly reduce sub-lists) but a true tree may works better (each letter is linked to its successors, no matters where it's inside the string so for "HOME" you have "H->O", "O->M" and "M->E". If you're searching for "om" you'll quickly find it. Problem is it becomes pretty more complicate and it may be too much for you scenario (IMO).
  • etaiso
    etaiso over 10 years
    I just tested your solution and it works perfectly! Nicely done. The only problem I have is that I can't make the _signal.Dispose(); to compile (error about protection-level).
  • Groo
    Groo over 10 years
    @etaiso: that's weird, where exaclty you calling _signal.Dispose() Is it somewhere outside the BackgroundWordFilter class?
  • etaiso
    etaiso over 10 years
    No. Exactly as in your snippet.
  • etaiso
    etaiso over 10 years
    I don't think this will improve anything as the BeginUpdate and EndUpdate are intended to use when adding items individually or when using AddRange().
  • Andris
    Andris over 10 years
    It depends on how the DataSource property is implemented. It might be worth a try.
  • Groo
    Groo over 10 years
    That's weird. AutoResetEvent inherits from WaitHandle, which implements IDisposable.Dispose(), meaning that Dispose must be a public method. Is this running on Mono, perhaps? I really haven't encountered anything like that before.
  • Matthew Watson
    Matthew Watson over 10 years
    @Groo It's explicit implementation, meaning you can't call it directly. You're supposed to use either a using block, or call WaitHandle.Close()
  • Frames Catherine White
    Frames Catherine White over 10 years
    If he does mean starts with, then the Trie can be combined with the background worker approach for improved performance
  • Groo
    Groo over 10 years
    @MatthewWatson: Can you back that up with a link of some sort? MSDN says it's a public method, and I've been using it that way for ages (and it works on my machine for this concrete snippet too). I don't see why they would choose to hide it, and then also provide an explicit implementation.
  • Alex
    Alex over 10 years
    @Groo your own link to AutoResetEvent backs it up. Scroll down to "Explicit interface implementations"
  • Groo
    Groo over 10 years
    @AlexG: that only means it provides an explicit implementation. The Dispose method is public. Are you saying that new AutoResetEvent(true).Dispose(); won't compile on your machine?
  • Groo
    Groo over 10 years
    Ok, now it makes sense, the method was made public in .NET 4. The MSDN page for .NET 4 lists it under public methods, while the page for .NET 3.5 shows it under protected ones. That also explains why there is a conditional define in the Mono source for WaitHandle.
  • Groo
    Groo over 10 years
    @etaiso: ok, it turns out that the method was protected prior to .NET 4 (and I presume you're using .NET 3.5 since you didn't complain about the lambda. This means you can simply change that line to (_signal as IDisposable).Dispose();, unless you decide to switch to .NET 4.
  • Matthew Watson
    Matthew Watson over 10 years
    @Groo Sorry, I should have mentioned I was talking about an older version of .Net - sorry about the confusion! However note that he doesn't need to cast - he can call .Close() instead, which itself calls .Dispose().
  • Groo
    Groo over 10 years
    @Matthew: sure, that's an even better idea, I guess that's how they planned for it to be used. I wonder if static code analyzer would complain about pulling that off, though. :-D
  • Adriano Repetti
    Adriano Repetti over 10 years
    He's not working with beginning of string (that's why he's using String.Contains()). With Contains() a sorted list doesn't change performance.
  • CodesInChaos
    CodesInChaos over 10 years
    From what I know suffix arrays are usually a better choice than suffix trees. Easier to implement and lower memory usage.
  • Basilevs
    Basilevs over 10 years
    I propose SortedList which is very easy to build and maintain at cost of memory overhead which can be minimized by providing lists capacities.
  • Basilevs
    Basilevs over 10 years
    Also, it seems arrays (and original ST) are designed to handle large texts, while here we have large amount of short chunks which is different task.
  • hardsky
    hardsky over 10 years
    Yes, with 'Contains' it's useless. I like suggestion with sufix tree stackoverflow.com/a/21383731/994849 There is a lot interesting answers in the thread, but it depends how much time he can spend on this task.
  • Gabe
    Gabe over 10 years
    Does textBox_search.Text contribute a measurable amount to the time? Getting the Text property on a text box once for each of 120k strings probably sends 120k messages to the edit control window.
  • Gabe
    Gabe over 10 years
    +1 while everybody else was trying to optimize the search that only takes 30ms, you're the only person who recognized that the problem is actually in filling the list box.
  • Andris
    Andris over 10 years
    @Gabe Yes it does. See my answer for details.
  • Matthew Watson
    Matthew Watson over 10 years
    @Gabe Good point! I'll add that to the answer for completeness.
  • etaiso
    etaiso over 10 years
    Hashmap with what key? I want to be able to find keywords that contained in the strings.
  • OrangeDog
    OrangeDog over 10 years
    +1 for the good approach, but I'd use a hash map or actual search tree rather than manually searching a list.
  • OrangeDog
    OrangeDog over 10 years
    Indexing is a hash-based search. You just add all sub-strings as keys instead of just the value.
  • Dennis
    Dennis over 10 years
    @OrangeDog: disagree. indexed search can be implemented as hash-based search by index keys, but it isn't necessary, and it isn't a hash-based search by the string value itself.
  • jnovacho
    jnovacho over 10 years
    Is there any advantage of using suffix tree instead of prefix tree?
  • Gabe
    Gabe over 10 years
    Your profiling results are very different from mine. I was able to search 120k strings in 30ms, but adding them to the listbox took 4500ms. It sounds like you're adding 2.5M strings to the listbox in under 600ms. How is that possible?
  • Basilevs
    Basilevs over 10 years
    @OrangeDog, how do use hash map to search for prefix? I could not find a collection with built-in bisection in .NET could you propose one to use as search tree?
  • Basilevs
    Basilevs over 10 years
    @jnovacho, there is no advantage except for ability to search for a substring
  • OrangeDog
    OrangeDog over 10 years
    @Basilevs you add all the prefixes as hash keys. A trie is better for prefix-search though.
  • Basilevs
    Basilevs over 10 years
    Linking to all substrings would increase collection size proportionally to the word length (while trie might save up on string length we still have to deal with references). As it is already pretty large I would not try that.
  • user
    user over 10 years
    @Dennis Agree. +1 to cancel the ghost -1.
  • OrangeDog
    OrangeDog over 10 years
    The question is only concerned with access time, not memory usage. A search tree is preferred to a hash map because hash collision is going to erode the O(1) complexity advantage the hash map would otherwise have.
  • Andris
    Andris over 10 years
    @Gabe While profiling I used an input where the filter text eliminated a big part of the original list. If I use an input where the filter text removes nothing from the list I get similar results to yours. I will update my response to clarify what I have measured.
  • Caramiriel
    Caramiriel over 10 years
    +1 because implementations like a text search engine have smart(er) optimizations than string.Contains. Ie. searching for ba in bcaaaabaa will result in a (indexed) skip list. The first b is considered, but doesn't match because the next is a c, so it will skip to the next b.
  • dada
    dada over 10 years
    for he key you can put the number in the list, if you want more complciated you can add number plus name the choice is yours.
  • dada
    dada over 10 years
    for the rest either i didn't read everything either there was a bad explanation (probably both ;) ) [quote] have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection. [/quote] I thought it was just a string search.
  • Jennifer Miles
    Jennifer Miles over 9 years
    See for a C# implementation of an event throttler the EventThrotler class in the Algorithmia library: github.com/SolutionsDesign/Algorithmia/blob/master/…