Fastest way to replace multiple strings in a huge string

30,840

Solution 1

Using unsafe and compiled as x64

result:

Implementation       | Exec   | GC
#1 Simple            | 4706ms |  0ms
#2 Simple parallel   | 2265ms |  0ms
#3 ParallelSubstring |  800ms | 21ms
#4 Fredou unsafe     |  432ms | 15ms

take the code of Erti-Chris Eelmaa and replace my previous one with this.

I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)

    private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)
    {
        var indexes = new List<int>();

        //input = "ABCDABCABCDABCABCDABCABCDABCD";
        //inputLength = input.Length;
        //replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };

        //my own string.indexof to save a few ms
        int len = inputLength;

        fixed (char* i = input, r = replace)
        {
            int replaceValAsInt = *((int*)r);

            while (--len > -1)
            {
                if (replaceValAsInt == *((int*)&i[len]))
                {
                    indexes.Add(len--);
                }
            }                
        }

        var idx = indexes.ToArray();
        len = indexes.Count;

        Parallel.For(0, replaceBy.Length, l =>
            Process(input, inputLength, replaceBy[l], idx, len)
        );
    }

    private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)
    {
        var output = new char[len];

        fixed (char* o = output, i = input, r = replaceBy)
        {
            int replaceByValAsInt = *((int*)r);

            //direct copy, simulate string.copy
            while (--len > -1)
            {
                o[len] = i[len];
            }

            while (--idxLen > -1)
            {
                ((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;
            }
        }

        //Console.WriteLine(output);
    }

Solution 2

I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.

For efficiency purposes, I set criteria to go through the original string only once.

I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements

The code for the Regex solution is similar to this:

Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
    // Fill the dictionary with the proper replacements:

        StringBuilder patternBuilder = new StringBuilder();
                patternBuilder.Append('(');

                bool firstReplacement = true;

                foreach (var replacement in replacements.Keys)
                {
                    if (!firstReplacement)
                        patternBuilder.Append('|');
                    else
                        firstReplacement = false;

                    patternBuilder.Append('(');
                    patternBuilder.Append(Regex.Escape(replacement));
                    patternBuilder.Append(')');
                }
                patternBuilder.Append(')');

                var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);

                return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value]));

EDIT: The execution times running the test application on my computer are:

  • Looping through the replacements calling string.Substring() (CASE SENSITIVE): 2ms
  • Single pass using Regex with multiple replacements at once (Case insensitive): 8ms
  • Looping through replacements using a ReplaceIgnoreCase Extension (Case insensitive): 55ms

Solution 3

It sounds like you are tokenising the string? I would look at producing a buffer and indexing your tokens. Or using a templating engine

As a naive example you could use code generation to make the following method

public string Produce(string tokenValue){

    var builder = new StringBuilder();
    builder.Append("A");
    builder.Append(tokenValue);
    builder.Append("D");

    return builder.ToString();

}

If your running the iterations enough times, the time to build the template will pay for itself. You can then also call that method in parallel with no side effects. Also look at interning your strings

Solution 4

You probably won't get anything faster than String.Replace (unless you go native) because iirc String.Replace is implemented in CLR itself for maximum performance. If you want 100% performance, you can conveniently interface with native ASM code via C++/CLI and go from there.

Solution 5

I made a variation on Fredou's code that requires less compares as it works on int* instead of char*. It still requires n iterations for a string of n length, it just has to do less comparing. You could have n/2 iterations if the string is neatly aligned by 2 (so the string to replace can only occur at indexes 0, 2, 4, 6, 8, etc) or even n/4 if it's aligned by 4 (you'd use long*). I'm not very good at bit fiddling like this, so someone might be able to find some obvious flaw in my code that could be more efficient. I verified that the result of my variation is the same as that of the simple string.Replace.

Additionally, I expect that some gains could be made in the 500x string.Copy that it does, but haven't looked into that yet.

My results (Fredou II):

IMPLEMENTATION       |  EXEC MS | GC MS
#1 Simple            |     6816 |     0
#2 Simple parallel   |     4202 |     0
#3 ParallelSubstring |    27839 |     4
#4 Fredou I          |     2103 |   106
#5 Fredou II         |     1334 |    91

So about 2/3 of the time (x86, but x64 was about the same).

For this code:

private unsafe struct TwoCharStringChunk
{
  public fixed char chars[2];
}

private unsafe static void FredouImplementation_Variation1(string input, int inputLength, string replace, TwoCharStringChunk[] replaceBy)
{
  var output = new string[replaceBy.Length];

  for (var i = 0; i < replaceBy.Length; ++i)
    output[i] = string.Copy(input);

  var r = new TwoCharStringChunk();
  r.chars[0] = replace[0];
  r.chars[1] = replace[1];

  _staticToReplace = r;

  Parallel.For(0, replaceBy.Length, l => Process_Variation1(output[l], input, inputLength, replaceBy[l]));
}

private static TwoCharStringChunk _staticToReplace ;

private static unsafe void Process_Variation1(string output, string input, int len, TwoCharStringChunk replaceBy)
{
  int n = 0;
  int m = len - 1;

  fixed (char* i = input, o = output, chars = _staticToReplace .chars)
  {
    var replaceValAsInt = *((int*)chars);
    var replaceByValAsInt = *((int*)replaceBy.chars);

    while (n < m)
    {
      var compareInput = *((int*)&i[n]);

      if (compareInput == replaceValAsInt)
      {
        ((int*)&o[n])[0] = replaceByValAsInt;
        n += 2;
      }
      else
      {
        ++n;
      }
    }
  }
}

The struct with the fixed buffer is not strictly necessary here and could have been replaced with a simple int field, but expand the char[2] to char[3] and this code can be made to work with three letter strings as well, which wouldn't be possible if it was an int field.

It required some changes to the Program.cs as well, so here's the full gist:

https://gist.github.com/JulianR/7763857

EDIT: I'm not sure why my ParallelSubstring is so slow. I'm running .NET 4 in Release mode, no debugger, in either x86 or x64.

Share:
30,840
Yannis
Author by

Yannis

Updated on April 21, 2021

Comments

  • Yannis
    Yannis about 3 years

    I'm looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing it.

    I just care about the fastest possible way. Not code readability, maintainability etc. I don't care if I need to use unsafe code or pre-process the original string either.

    Each replace iteration will replace ABC on the string with some other string (different per replace iteration). The string to replace will ALWAYS be the same - ABC will always be ABC. Never ABD. So if there are 400.000 thousands replace iterations. The same string - ABC - will be replaced with some other (different) string each time.

    I can be in control of what ABC is. I can make it super-short or super-long as long as it doesn't affect the results. Clearly ABC can't be hello cause hello will exist as a word in most of the input strings.

    Example input: ABCDABCABCDABCABCDABCABCDABCD

    Example replace from string: BC

    Example replace with strings: AA, BB, CC, DD, EE (5 iterations)

    Example outputs:

    AAADAAAAAADAAAAAADAAAAAADAAAD
    ABBDABBABBDABBABBDABBABBDABBD
    ACCDACCACCDACCACCDACCACCDACCD
    ADDDADDADDDADDADDDADDADDDADDD
    AEEDAEEAEEDAEEAEEDAEEAEEDAEED
    

    Average case: Input string is 100-200kb with 40.000 replace iterations. Worst case: Input string is 1-2mb with 400.000 replace iterations.

    I can do ANYTHING. Do it in parallel, do it unsafe, etc. It doesn't matter how I do it. What matters is that it needs to be as fast as it gets.