Finding all positions of substring in a larger string in C#
Solution 1
Here's an example extension method for it:
public static List<int> AllIndexesOf(this string str, string value) {
if (String.IsNullOrEmpty(value))
throw new ArgumentException("the string to find may not be empty", "value");
List<int> indexes = new List<int>();
for (int index = 0;; index += value.Length) {
index = str.IndexOf(value, index);
if (index == -1)
return indexes;
indexes.Add(index);
}
}
If you put this into a static class and import the namespace with using
, it appears as a method on any string, and you can just do:
List<int> indexes = "fooStringfooBar".AllIndexesOf("foo");
For more information on extension methods, http://msdn.microsoft.com/en-us/library/bb383977.aspx
Also the same using an iterator:
public static IEnumerable<int> AllIndexesOf(this string str, string value) {
if (String.IsNullOrEmpty(value))
throw new ArgumentException("the string to find may not be empty", "value");
for (int index = 0;; index += value.Length) {
index = str.IndexOf(value, index);
if (index == -1)
break;
yield return index;
}
}
Solution 2
Why don't you use the built in RegEx class:
public static IEnumerable<int> GetAllIndexes(this string source, string matchString)
{
matchString = Regex.Escape(matchString);
foreach (Match match in Regex.Matches(source, matchString))
{
yield return match.Index;
}
}
If you do need to reuse the expression then compile it and cache it somewhere. Change the matchString param to a Regex matchExpression in another overload for the reuse case.
Solution 3
using LINQ
public static IEnumerable<int> IndexOfAll(this string sourceString, string subString)
{
return Regex.Matches(sourceString, subString).Cast<Match>().Select(m => m.Index);
}
Solution 4
Polished version + case ignoring support:
public static int[] AllIndexesOf(string str, string substr, bool ignoreCase = false)
{
if (string.IsNullOrWhiteSpace(str) ||
string.IsNullOrWhiteSpace(substr))
{
throw new ArgumentException("String or substring is not specified.");
}
var indexes = new List<int>();
int index = 0;
while ((index = str.IndexOf(substr, index, ignoreCase ? StringComparison.OrdinalIgnoreCase : StringComparison.Ordinal)) != -1)
{
indexes.Add(index++);
}
return indexes.ToArray();
}
Solution 5
It could be done in efficient time complexity using KMP algorithm in O(N + M) where N is the length of text
and M is the length of the pattern
.
This is the implementation and usage:
static class StringExtensions
{
public static IEnumerable<int> AllIndicesOf(this string text, string pattern)
{
if (string.IsNullOrEmpty(pattern))
{
throw new ArgumentNullException(nameof(pattern));
}
return Kmp(text, pattern);
}
private static IEnumerable<int> Kmp(string text, string pattern)
{
int M = pattern.Length;
int N = text.Length;
int[] lps = LongestPrefixSuffix(pattern);
int i = 0, j = 0;
while (i < N)
{
if (pattern[j] == text[i])
{
j++;
i++;
}
if (j == M)
{
yield return i - j;
j = lps[j - 1];
}
else if (i < N && pattern[j] != text[i])
{
if (j != 0)
{
j = lps[j - 1];
}
else
{
i++;
}
}
}
}
private static int[] LongestPrefixSuffix(string pattern)
{
int[] lps = new int[pattern.Length];
int length = 0;
int i = 1;
while (i < pattern.Length)
{
if (pattern[i] == pattern[length])
{
length++;
lps[i] = length;
i++;
}
else
{
if (length != 0)
{
length = lps[length - 1];
}
else
{
lps[i] = length;
i++;
}
}
}
return lps;
}
and this is an example of how to use it:
static void Main(string[] args)
{
string text = "this is a test";
string pattern = "is";
foreach (var index in text.AllIndicesOf(pattern))
{
Console.WriteLine(index); // 2 5
}
}
caesay
I'm in a committed relationship with good code, so if we go out it will only be as friends. Feel free to contact me by email at mk.stackexchange<a>caesay.com
Updated on August 09, 2021Comments
-
caesay almost 3 years
I have a large string I need to parse, and I need to find all the instances of
extract"(me,i-have lots. of]punctuation
, and store the index of each to a list.So say this piece of string was in the beginning and middle of the larger string, both of them would be found, and their indexes would be added to the
List
. and theList
would contain0
and the other index whatever it would be.I've been playing around, and the
string.IndexOf
does almost what I'm looking for, and I've written some code - but it's not working and I've been unable to figure out exactly what is wrong:List<int> inst = new List<int>(); int index = 0; while (index < source.LastIndexOf("extract\"(me,i-have lots. of]punctuation", 0) + 39) { int src = source.IndexOf("extract\"(me,i-have lots. of]punctuation", index); inst.Add(src); index = src + 40; }
inst
= The listsource
= The large string
Any better ideas?
-
m0sa about 14 yearsWhy not use IEnumerable<int> and yield return index instead of the indexes list?
-
csaam about 14 yearsYou forgot to escape the subString though.
-
Matti Virkkunen about 14 years@m0sa: Good point. Added another version just for the fun of it.
-
caesay over 11 yearsif his code was incorrect you could've edited his post to correct it
-
arame3333 over 11 yearsI hadn't noticed that. I have to admit to being reluctant to do that, just in case I am wrong, although I don't think I am.
-
PedroC88 over 10 yearsDoes the use of yield have any incidence in the performance? Does the retrieval of the objects becomes async or lazy? Os is it just syntactic sugar and both codes are the same?
-
Matti Virkkunen over 10 years@PedroC88: Using
yield
will make the code "lazy". It won't collect all the indexes into an in-memory list within the method. What kind of practical effect that has on performance depends on a lot of factors. -
Anshul about 8 yearsThis doesn't compile
-
Saggio about 8 yearswhat is
indexes
? It's not defined anywhere. -
csaam about 8 yearsMy bad it's a remnant. Delete that line.
-
Marek Woźniak almost 8 yearsthat's not good idea to use regex for large string. The approach takes a lot of memory.
-
Pranay Deep over 7 yearspublic static List<int> AllIndexesOf(this string str, string value) { if (String.IsNullOrEmpty(value)) throw new ArgumentException("the string to find may not be empty", "value"); List<int> indexes = new List<int>(); for (int index = 0;; index += value.Length) { index = str.IndexOf(value, index); if (index == -1) return indexes; indexes.Add(index); index--; } } This covers tests cases like AOOAOOA Output 0 and 3
-
Paul over 7 yearsI'm confused by the exception text provided. You're telling the user that the string may not be empty if the string is null or empty?
-
Matti Virkkunen over 7 years@Paul: "May not" as in "must not". If you don't like the wording you can always suggest an edit, but I don't think it's that hard to understand.
-
Paul over 7 yearsAha. Completely missed the meaning of the message. Thank you.
-
Christoph Meißner about 7 yearsAttention! Due to adding
value.Length
you may miss nested matches! Example: "This is a NestedNestedNested match test!" with matching for "NestedNested" will find only one index, but not the nested one. To fix this just add+=1
in loop instead of+=value.Length
. -
user280498 over 6 yearsBeware that this method has the same flaw as the accepted answer. If your source string is "ccc" and pattern is "cc" then it will return only one occurrence.
-
ScubaSteve over 5 years@MattiVirkkunen I like this answer. I would change one thing. Instead of throwing an exception, just return null or empty list as this would show that nothing was found without run time exceptions being thrown.
-
caesay about 5 yearsHi Mark, I see that you're new to stackoverflow. This answer doesn't add anything to this old question, there are already much better answers. If answering a question like this in the future, please try to explain why your answer contains some information or value which does not already exist in other answers.
-
Denny Jacob about 4 yearsThis is preferable to the accepted solution because of its lower cyclomatic complexity.
-
M.Khooryani almost 4 yearsYour solution is not correct! for finding "aa" in "aaa", the output should be [0, 1] but your algorithm returns [0]
-
Matti Virkkunen almost 4 years@M.Khooryani Depends on the definition of correct. This finds non-overlapping substrings, which seemed to be what OP was looking for.
-
caesay almost 4 yearsWhat is the performance of this compared to the optimal IndexOf implementation, where the search start index is set to the end of the previous match on each iteration?
-
M.Khooryani almost 4 yearsComparing IndexOf with AllIndicesOf is incorrect since their output is different. Using the IndexOf method in each iteration, enormously increases the time complexity to O(N^2 M), while optimal complexity is O(N+M). KMP doesn't work similarly to the naive approach, it uses a precomputed array(LPS) to avoid searching from the start. Recommend you read the KMP algorithm. The last paragraphs of the "Background" section in Wikipedia explain how it works in O(N).
-
b-pos465 over 3 yearsThere are two problems here: First, you always add -1 to your result list, which is not a valid result. Second, the code does not terminate due to
indexOf
returning -1 and theindex++
. I would use awhile (true)
with abreak;
if the result ofIndexOf
is -1. -
DAG over 3 yearsYou should throw an exception
-
Kylo Ren over 3 yearsseems not working, for "aaa" and "aa" inputs.
-
Kiquenet over 3 yearsOther string-searching algorithms in C# ?
-
ehosca almost 3 yearsthat is the expected behavior... when you match "aa" in "aaa" what remains is "xxa" that does not match "aa"
-
Gert Arnold almost 3 yearsPlease don't aks questions in answers. Explain instead. In this case: explain why your answer improves all other answers given here.
-
drdwilcox over 2 yearsI do not understand why more people have not accepted this answer. It performs better on long input strings and with long search strings. It is more complex to read, but taking it as a black box and using it as the go-to for solving this problem with long input strings is the way to go.
-
Mr. Squirrel.Downy about 2 years@user280498 This is not necessarily a "flaw" depending on the usage scenario...
-
martinoss about 2 yearsWarning: IndexOutOfRangeException when string contains German "ß" character. Example: "Vertrag abschließen", search "abschliessen". To fix it, go with the suggestion from @ChristophMeißner to replace
+=value.Length
with+=1
-
M.Khooryani almost 2 years@drdwilcox because this answer was posted 10 years after the verified answer.