Finding all positions of substring in a larger string in C#

120,123

Solution 1

Here's an example extension method for it:

public static List<int> AllIndexesOf(this string str, string value) {
    if (String.IsNullOrEmpty(value))
        throw new ArgumentException("the string to find may not be empty", "value");
    List<int> indexes = new List<int>();
    for (int index = 0;; index += value.Length) {
        index = str.IndexOf(value, index);
        if (index == -1)
            return indexes;
        indexes.Add(index);
    }
}

If you put this into a static class and import the namespace with using, it appears as a method on any string, and you can just do:

List<int> indexes = "fooStringfooBar".AllIndexesOf("foo");

For more information on extension methods, http://msdn.microsoft.com/en-us/library/bb383977.aspx

Also the same using an iterator:

public static IEnumerable<int> AllIndexesOf(this string str, string value) {
    if (String.IsNullOrEmpty(value))
        throw new ArgumentException("the string to find may not be empty", "value");
    for (int index = 0;; index += value.Length) {
        index = str.IndexOf(value, index);
        if (index == -1)
            break;
        yield return index;
    }
}

Solution 2

Why don't you use the built in RegEx class:

public static IEnumerable<int> GetAllIndexes(this string source, string matchString)
{
   matchString = Regex.Escape(matchString);
   foreach (Match match in Regex.Matches(source, matchString))
   {
      yield return match.Index;
   }
}

If you do need to reuse the expression then compile it and cache it somewhere. Change the matchString param to a Regex matchExpression in another overload for the reuse case.

Solution 3

using LINQ

public static IEnumerable<int> IndexOfAll(this string sourceString, string subString)
{
    return Regex.Matches(sourceString, subString).Cast<Match>().Select(m => m.Index);
}

Solution 4

Polished version + case ignoring support:

public static int[] AllIndexesOf(string str, string substr, bool ignoreCase = false)
{
    if (string.IsNullOrWhiteSpace(str) ||
        string.IsNullOrWhiteSpace(substr))
    {
        throw new ArgumentException("String or substring is not specified.");
    }

    var indexes = new List<int>();
    int index = 0;

    while ((index = str.IndexOf(substr, index, ignoreCase ? StringComparison.OrdinalIgnoreCase : StringComparison.Ordinal)) != -1)
    {
        indexes.Add(index++);
    }

    return indexes.ToArray();
}

Solution 5

It could be done in efficient time complexity using KMP algorithm in O(N + M) where N is the length of text and M is the length of the pattern.

This is the implementation and usage:

static class StringExtensions
{
    public static IEnumerable<int> AllIndicesOf(this string text, string pattern)
    {
        if (string.IsNullOrEmpty(pattern))
        {
            throw new ArgumentNullException(nameof(pattern));
        }
        return Kmp(text, pattern);
    }

    private static IEnumerable<int> Kmp(string text, string pattern)
    {
        int M = pattern.Length;
        int N = text.Length;

        int[] lps = LongestPrefixSuffix(pattern);
        int i = 0, j = 0; 

        while (i < N)
        {
            if (pattern[j] == text[i])
            {
                j++;
                i++;
            }
            if (j == M)
            {
                yield return i - j;
                j = lps[j - 1];
            }

            else if (i < N && pattern[j] != text[i])
            {
                if (j != 0)
                {
                    j = lps[j - 1];
                }
                else
                {
                    i++;
                }
            }
        }
    }

    private static int[] LongestPrefixSuffix(string pattern)
    {
        int[] lps = new int[pattern.Length];
        int length = 0;
        int i = 1;

        while (i < pattern.Length)
        {
            if (pattern[i] == pattern[length])
            {
                length++;
                lps[i] = length;
                i++;
            }
            else
            {
                if (length != 0)
                {
                    length = lps[length - 1];
                }
                else
                {
                    lps[i] = length;
                    i++;
                }
            }
        }
        return lps;
    }

and this is an example of how to use it:

static void Main(string[] args)
    {
        string text = "this is a test";
        string pattern = "is";
        foreach (var index in text.AllIndicesOf(pattern))
        {
            Console.WriteLine(index); // 2 5
        }
    }
Share:
120,123
caesay
Author by

caesay

I'm in a committed relationship with good code, so if we go out it will only be as friends. Feel free to contact me by email at mk.stackexchange&lt;a&gt;caesay.com

Updated on August 09, 2021

Comments

  • caesay
    caesay almost 3 years

    I have a large string I need to parse, and I need to find all the instances of extract"(me,i-have lots. of]punctuation, and store the index of each to a list.

    So say this piece of string was in the beginning and middle of the larger string, both of them would be found, and their indexes would be added to the List. and the List would contain 0 and the other index whatever it would be.

    I've been playing around, and the string.IndexOf does almost what I'm looking for, and I've written some code - but it's not working and I've been unable to figure out exactly what is wrong:

    List<int> inst = new List<int>();
    int index = 0;
    while (index < source.LastIndexOf("extract\"(me,i-have lots. of]punctuation", 0) + 39)
    {
        int src = source.IndexOf("extract\"(me,i-have lots. of]punctuation", index);
        inst.Add(src);
        index = src + 40;
    }
    
    • inst = The list
    • source = The large string

    Any better ideas?

  • m0sa
    m0sa about 14 years
    Why not use IEnumerable<int> and yield return index instead of the indexes list?
  • csaam
    csaam about 14 years
    You forgot to escape the subString though.
  • Matti Virkkunen
    Matti Virkkunen about 14 years
    @m0sa: Good point. Added another version just for the fun of it.
  • caesay
    caesay over 11 years
    if his code was incorrect you could've edited his post to correct it
  • arame3333
    arame3333 over 11 years
    I hadn't noticed that. I have to admit to being reluctant to do that, just in case I am wrong, although I don't think I am.
  • PedroC88
    PedroC88 over 10 years
    Does the use of yield have any incidence in the performance? Does the retrieval of the objects becomes async or lazy? Os is it just syntactic sugar and both codes are the same?
  • Matti Virkkunen
    Matti Virkkunen over 10 years
    @PedroC88: Using yield will make the code "lazy". It won't collect all the indexes into an in-memory list within the method. What kind of practical effect that has on performance depends on a lot of factors.
  • Anshul
    Anshul about 8 years
    This doesn't compile
  • Saggio
    Saggio about 8 years
    what is indexes? It's not defined anywhere.
  • csaam
    csaam about 8 years
    My bad it's a remnant. Delete that line.
  • Marek Woźniak
    Marek Woźniak almost 8 years
    that's not good idea to use regex for large string. The approach takes a lot of memory.
  • Pranay Deep
    Pranay Deep over 7 years
    public static List<int> AllIndexesOf(this string str, string value) { if (String.IsNullOrEmpty(value)) throw new ArgumentException("the string to find may not be empty", "value"); List<int> indexes = new List<int>(); for (int index = 0;; index += value.Length) { index = str.IndexOf(value, index); if (index == -1) return indexes; indexes.Add(index); index--; } } This covers tests cases like AOOAOOA Output 0 and 3
  • Paul
    Paul over 7 years
    I'm confused by the exception text provided. You're telling the user that the string may not be empty if the string is null or empty?
  • Matti Virkkunen
    Matti Virkkunen over 7 years
    @Paul: "May not" as in "must not". If you don't like the wording you can always suggest an edit, but I don't think it's that hard to understand.
  • Paul
    Paul over 7 years
    Aha. Completely missed the meaning of the message. Thank you.
  • Christoph Meißner
    Christoph Meißner about 7 years
    Attention! Due to adding value.Length you may miss nested matches! Example: "This is a NestedNestedNested match test!" with matching for "NestedNested" will find only one index, but not the nested one. To fix this just add +=1 in loop instead of +=value.Length.
  • user280498
    user280498 over 6 years
    Beware that this method has the same flaw as the accepted answer. If your source string is "ccc" and pattern is "cc" then it will return only one occurrence.
  • ScubaSteve
    ScubaSteve over 5 years
    @MattiVirkkunen I like this answer. I would change one thing. Instead of throwing an exception, just return null or empty list as this would show that nothing was found without run time exceptions being thrown.
  • caesay
    caesay about 5 years
    Hi Mark, I see that you're new to stackoverflow. This answer doesn't add anything to this old question, there are already much better answers. If answering a question like this in the future, please try to explain why your answer contains some information or value which does not already exist in other answers.
  • Denny Jacob
    Denny Jacob about 4 years
    This is preferable to the accepted solution because of its lower cyclomatic complexity.
  • M.Khooryani
    M.Khooryani almost 4 years
    Your solution is not correct! for finding "aa" in "aaa", the output should be [0, 1] but your algorithm returns [0]
  • Matti Virkkunen
    Matti Virkkunen almost 4 years
    @M.Khooryani Depends on the definition of correct. This finds non-overlapping substrings, which seemed to be what OP was looking for.
  • caesay
    caesay almost 4 years
    What is the performance of this compared to the optimal IndexOf implementation, where the search start index is set to the end of the previous match on each iteration?
  • M.Khooryani
    M.Khooryani almost 4 years
    Comparing IndexOf with AllIndicesOf is incorrect since their output is different. Using the IndexOf method in each iteration, enormously increases the time complexity to O(N^2 M), while optimal complexity is O(N+M). KMP doesn't work similarly to the naive approach, it uses a precomputed array(LPS) to avoid searching from the start. Recommend you read the KMP algorithm. The last paragraphs of the "Background" section in Wikipedia explain how it works in O(N).
  • b-pos465
    b-pos465 over 3 years
    There are two problems here: First, you always add -1 to your result list, which is not a valid result. Second, the code does not terminate due to indexOf returning -1 and the index++. I would use a while (true) with a break; if the result of IndexOf is -1.
  • DAG
    DAG over 3 years
    You should throw an exception
  • Kylo Ren
    Kylo Ren over 3 years
    seems not working, for "aaa" and "aa" inputs.
  • Kiquenet
    Kiquenet over 3 years
    Other string-searching algorithms in C# ?
  • ehosca
    ehosca almost 3 years
    that is the expected behavior... when you match "aa" in "aaa" what remains is "xxa" that does not match "aa"
  • Gert Arnold
    Gert Arnold almost 3 years
    Please don't aks questions in answers. Explain instead. In this case: explain why your answer improves all other answers given here.
  • drdwilcox
    drdwilcox over 2 years
    I do not understand why more people have not accepted this answer. It performs better on long input strings and with long search strings. It is more complex to read, but taking it as a black box and using it as the go-to for solving this problem with long input strings is the way to go.
  • Mr. Squirrel.Downy
    Mr. Squirrel.Downy about 2 years
    @user280498 This is not necessarily a "flaw" depending on the usage scenario...
  • martinoss
    martinoss about 2 years
    Warning: IndexOutOfRangeException when string contains German "ß" character. Example: "Vertrag abschließen", search "abschliessen". To fix it, go with the suggestion from @ChristophMeißner to replace +=value.Length with +=1
  • M.Khooryani
    M.Khooryani almost 2 years
    @drdwilcox because this answer was posted 10 years after the verified answer.