Efficient way to remove ALL whitespace from String?

704,153

Solution 1

This is fastest way I know of, even though you said you didn't want to use regular expressions:

Regex.Replace(XML, @"\s+", "");

Crediting @hypehuman in the comments, if you plan to do this more than once, create and store a Regex instance. This will save the overhead of constructing it every time, which is more expensive than you might think.

private static readonly Regex sWhitespace = new Regex(@"\s+");
public static string ReplaceWhitespace(string input, string replacement) 
{
    return sWhitespace.Replace(input, replacement);
}

Solution 2

I have an alternative way without regexp, and it seems to perform pretty good. It is a continuation on Brandon Moretz answer:

 public static string RemoveWhitespace(this string input)
 {
    return new string(input.ToCharArray()
        .Where(c => !Char.IsWhiteSpace(c))
        .ToArray());
 }

I tested it in a simple unit test:

[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace1(string input, string expected)
{
    string s = null;
    for (int i = 0; i < 1000000; i++)
    {
        s = input.RemoveWhitespace();
    }
    Assert.AreEqual(expected, s);
}

[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace2(string input, string expected)
{
    string s = null;
    for (int i = 0; i < 1000000; i++)
    {
        s = Regex.Replace(input, @"\s+", "");
    }
    Assert.AreEqual(expected, s);
}

For 1,000,000 attempts the first option (without regexp) runs in less then a second (700 ms on my machine), and the second takes 3.5 seconds.

Solution 3

Try the replace method of the string in C#.

XML.Replace(" ", string.Empty);

Solution 4

My solution is to use Split and Join and it is surprisingly fast, in fact the fastest of the top answers here.

str = string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));

Timings for 10,000 loop on simple string with whitespace inc new lines and tabs

  • split/join = 60 milliseconds
  • linq chararray = 94 milliseconds
  • regex = 437 milliseconds

Improve this by wrapping it up in method to give it meaning, and also make it an extension method while we are at it ...

public static string RemoveWhitespace(this string str) {
    return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
}

Solution 5

Building on Henks answer I have created some test methods with his answer and some added, more optimized, methods. I found the results differ based on the size of the input string. Therefore, I have tested with two result sets. In the fastest method, the linked source has a even faster way. But, since it is characterized as unsafe I have left this out.

Long input string results:

  1. InPlaceCharArray: 2021 ms (Sunsetquest's answer) - (Original source)
  2. String split then join: 4277ms (Kernowcode's answer)
  3. String reader: 6082 ms
  4. LINQ using native char.IsWhitespace: 7357 ms
  5. LINQ: 7746 ms (Henk's answer)
  6. ForLoop: 32320 ms
  7. RegexCompiled: 37157 ms
  8. Regex: 42940 ms

Short input string results:

  1. InPlaceCharArray: 108 ms (Sunsetquest's answer) - (Original source)
  2. String split then join: 294 ms (Kernowcode's answer)
  3. String reader: 327 ms
  4. ForLoop: 343 ms
  5. LINQ using native char.IsWhitespace: 624 ms
  6. LINQ: 645ms (Henk's answer)
  7. RegexCompiled: 1671 ms
  8. Regex: 2599 ms

Code:

public class RemoveWhitespace
{
    public static string RemoveStringReader(string input)
    {
        var s = new StringBuilder(input.Length); // (input.Length);
        using (var reader = new StringReader(input))
        {
            int i = 0;
            char c;
            for (; i < input.Length; i++)
            {
                c = (char)reader.Read();
                if (!char.IsWhiteSpace(c))
                {
                    s.Append(c);
                }
            }
        }

        return s.ToString();
    }

    public static string RemoveLinqNativeCharIsWhitespace(string input)
    {
        return new string(input.ToCharArray()
            .Where(c => !char.IsWhiteSpace(c))
            .ToArray());
    }

    public static string RemoveLinq(string input)
    {
        return new string(input.ToCharArray()
            .Where(c => !Char.IsWhiteSpace(c))
            .ToArray());
    }

    public static string RemoveRegex(string input)
    {
        return Regex.Replace(input, @"\s+", "");
    }

    private static Regex compiled = new Regex(@"\s+", RegexOptions.Compiled);
    public static string RemoveRegexCompiled(string input)
    {
        return compiled.Replace(input, "");
    }

    public static string RemoveForLoop(string input)
    {
        for (int i = input.Length - 1; i >= 0; i--)
        {
            if (char.IsWhiteSpace(input[i]))
            {
                input = input.Remove(i, 1);
            }
        }
        return input;
    }

    public static string StringSplitThenJoin(this string str)
    {
        return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
    }

    public static string RemoveInPlaceCharArray(string input)
    {
        var len = input.Length;
        var src = input.ToCharArray();
        int dstIdx = 0;
        for (int i = 0; i < len; i++)
        {
            var ch = src[i];
            switch (ch)
            {
                case '\u0020':
                case '\u00A0':
                case '\u1680':
                case '\u2000':
                case '\u2001':
                case '\u2002':
                case '\u2003':
                case '\u2004':
                case '\u2005':
                case '\u2006':
                case '\u2007':
                case '\u2008':
                case '\u2009':
                case '\u200A':
                case '\u202F':
                case '\u205F':
                case '\u3000':
                case '\u2028':
                case '\u2029':
                case '\u0009':
                case '\u000A':
                case '\u000B':
                case '\u000C':
                case '\u000D':
                case '\u0085':
                    continue;
                default:
                    src[dstIdx++] = ch;
                    break;
            }
        }
        return new string(src, 0, dstIdx);
    }
}

Tests:

[TestFixture]
public class Test
{
    // Short input
    //private const string input = "123 123 \t 1adc \n 222";
    //private const string expected = "1231231adc222";

    // Long input
    private const string input = "123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222";
    private const string expected = "1231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc222";

    private const int iterations = 1000000;

    [Test]
    public void RemoveInPlaceCharArray()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveInPlaceCharArray(input);
        }

        stopwatch.Stop();
        Console.WriteLine("InPlaceCharArray: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveStringReader()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveStringReader(input);
        }

        stopwatch.Stop();
        Console.WriteLine("String reader: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveLinqNativeCharIsWhitespace()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveLinqNativeCharIsWhitespace(input);
        }

        stopwatch.Stop();
        Console.WriteLine("LINQ using native char.IsWhitespace: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveLinq()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveLinq(input);
        }

        stopwatch.Stop();
        Console.WriteLine("LINQ: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveRegex()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveRegex(input);
        }

        stopwatch.Stop();
        Console.WriteLine("Regex: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveRegexCompiled()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveRegexCompiled(input);
        }

        stopwatch.Stop();
        Console.WriteLine("RegexCompiled: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveForLoop()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveForLoop(input);
        }

        stopwatch.Stop();
        Console.WriteLine("ForLoop: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [TestMethod]
    public void StringSplitThenJoin()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.StringSplitThenJoin(input);
        }

        stopwatch.Stop();
        Console.WriteLine("StringSplitThenJoin: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }
}

Edit: Tested a nice one liner from Kernowcode.

Share:
704,153
Corey Ogburn
Author by

Corey Ogburn

Updated on January 13, 2022

Comments

  • Corey Ogburn
    Corey Ogburn over 2 years

    I'm calling a REST API and am receiving an XML response back. It returns a list of a workspace names, and I'm writing a quick IsExistingWorkspace() method. Since all workspaces consist of contiguous characters with no whitespace, I'm assuming the easiest way to find out if a particular workspace is in the list is to remove all whitespace (including newlines) and doing this (XML is the string received from the web request):

    XML.Contains("<name>" + workspaceName + "</name>");
    

    I know it's case-sensitive, and I'm relying on that. I just need a way to remove all whitespace in a string efficiently. I know RegEx and LINQ can do it, but I'm open to other ideas. I am mostly just concerned about speed.