How to replace multiple white spaces with one white space

115,334

Solution 1

string cleanedString = System.Text.RegularExpressions.Regex.Replace(dirtyString,@"\s+"," ");

Solution 2

This question isn't as simple as other posters have made it out to be (and as I originally believed it to be) - because the question isn't quite precise as it needs to be.

There's a difference between "space" and "whitespace". If you only mean spaces, then you should use a regex of " {2,}". If you mean any whitespace, that's a different matter. Should all whitespace be converted to spaces? What should happen to space at the start and end?

For the benchmark below, I've assumed that you only care about spaces, and you don't want to do anything to single spaces, even at the start and end.

Note that correctness is almost always more important than performance. The fact that the Split/Join solution removes any leading/trailing whitespace (even just single spaces) is incorrect as far as your specified requirements (which may be incomplete, of course).

The benchmark uses MiniBench.

using System;
using System.Text.RegularExpressions;
using MiniBench;

internal class Program
{
    public static void Main(string[] args)
    {

        int size = int.Parse(args[0]);
        int gapBetweenExtraSpaces = int.Parse(args[1]);

        char[] chars = new char[size];
        for (int i=0; i < size/2; i += 2)
        {
            // Make sure there actually *is* something to do
            chars[i*2] = (i % gapBetweenExtraSpaces == 1) ? ' ' : 'x';
            chars[i*2 + 1] = ' ';
        }
        // Just to make sure we don't have a \0 at the end
        // for odd sizes
        chars[chars.Length-1] = 'y';

        string bigString = new string(chars);
        // Assume that one form works :)
        string normalized = NormalizeWithSplitAndJoin(bigString);


        var suite = new TestSuite<string, string>("Normalize")
            .Plus(NormalizeWithSplitAndJoin)
            .Plus(NormalizeWithRegex)
            .RunTests(bigString, normalized);

        suite.Display(ResultColumns.All, suite.FindBest());
    }

    private static readonly Regex MultipleSpaces = 
        new Regex(@" {2,}", RegexOptions.Compiled);

    static string NormalizeWithRegex(string input)
    {
        return MultipleSpaces.Replace(input, " ");
    }

    // Guessing as the post doesn't specify what to use
    private static readonly char[] Whitespace =
        new char[] { ' ' };

    static string NormalizeWithSplitAndJoin(string input)
    {
        string[] split = input.Split
            (Whitespace, StringSplitOptions.RemoveEmptyEntries);
        return string.Join(" ", split);
    }
}

A few test runs:

c:\Users\Jon\Test>test 1000 50
============ Normalize ============
NormalizeWithSplitAndJoin  1159091 0:30.258 22.93
NormalizeWithRegex        26378882 0:30.025  1.00

c:\Users\Jon\Test>test 1000 5
============ Normalize ============
NormalizeWithSplitAndJoin  947540 0:30.013 1.07
NormalizeWithRegex        1003862 0:29.610 1.00


c:\Users\Jon\Test>test 1000 1001
============ Normalize ============
NormalizeWithSplitAndJoin  1156299 0:29.898 21.99
NormalizeWithRegex        23243802 0:27.335  1.00

Here the first number is the number of iterations, the second is the time taken, and the third is a scaled score with 1.0 being the best.

That shows that in at least some cases (including this one) a regular expression can outperform the Split/Join solution, sometimes by a very significant margin.

However, if you change to an "all whitespace" requirement, then Split/Join does appear to win. As is so often the case, the devil is in the detail...

Solution 3

A regular expressoin would be the easiest way. If you write the regex the correct way, you wont need multiple calls.

Change it to this:

string s = System.Text.RegularExpressions.Regex.Replace(s, @"\s{2,}", " "); 

Solution 4

While the existing answers are fine, I'd like to point out one approach which doesn't work:

public static string DontUseThisToCollapseSpaces(string text)
{
    while (text.IndexOf("  ") != -1)
    {
        text = text.Replace("  ", " ");
    }
    return text;
}

This can loop forever. Anyone care to guess why? (I only came across this when it was asked as a newsgroup question a few years ago... someone actually ran into it as a problem.)

Solution 5

A fast extra whitespace remover by Felipe Machado. (Modified by RW for multi-space removal)

static string DuplicateWhiteSpaceRemover(string str)
{
    var len = str.Length;
    var src = str.ToCharArray();
    int dstIdx = 0;
    bool lastWasWS = false; //Added line
    for (int i = 0; i < len; i++)
    {
        var ch = src[i];
        switch (ch)
        {
            case '\u0020': //SPACE
            case '\u00A0': //NO-BREAK SPACE
            case '\u1680': //OGHAM SPACE MARK
            case '\u2000': // EN QUAD
            case '\u2001': //EM QUAD
            case '\u2002': //EN SPACE
            case '\u2003': //EM SPACE
            case '\u2004': //THREE-PER-EM SPACE
            case '\u2005': //FOUR-PER-EM SPACE
            case '\u2006': //SIX-PER-EM SPACE
            case '\u2007': //FIGURE SPACE
            case '\u2008': //PUNCTUATION SPACE
            case '\u2009': //THIN SPACE
            case '\u200A': //HAIR SPACE
            case '\u202F': //NARROW NO-BREAK SPACE
            case '\u205F': //MEDIUM MATHEMATICAL SPACE
            case '\u3000': //IDEOGRAPHIC SPACE
            case '\u2028': //LINE SEPARATOR
            case '\u2029': //PARAGRAPH SEPARATOR
            case '\u0009': //[ASCII Tab]
            case '\u000A': //[ASCII Line Feed]
            case '\u000B': //[ASCII Vertical Tab]
            case '\u000C': //[ASCII Form Feed]
            case '\u000D': //[ASCII Carriage Return]
            case '\u0085': //NEXT LINE
                if (lastWasWS == false) //Added line
                {
                    src[dstIdx++] = ' '; // Updated by Ryan
                    lastWasWS = true; //Added line
                }
                continue;
            default:
                lastWasWS = false; //Added line 
                src[dstIdx++] = ch;
                break;
        }
    }
    return new string(src, 0, dstIdx);
}

The benchmarks...

|                           | Time  |   TEST 1    |   TEST 2    |   TEST 3    |   TEST 4    |   TEST 5    |
| Function Name             |(ticks)| dup. spaces | spaces+tabs | spaces+CR/LF| " " -> " "  | " " -> " " |
|---------------------------|-------|-------------|-------------|-------------|-------------|-------------|
| SwitchStmtBuildSpaceOnly  |   5.2 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| InPlaceCharArraySpaceOnly |   5.6 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| DuplicateWhiteSpaceRemover|   7.0 |    PASS     |    PASS     |    PASS     |    PASS     |    PASS     |
| SingleSpacedTrim          |  11.8 |    PASS     |    PASS     |    PASS     |    FAIL     |    FAIL     |
| Fubo(StringBuilder)       |    13 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| User214147                |    19 |    PASS     |    PASS     |    PASS     |    FAIL     |    FAIL     | 
| RegExWithCompile          |    28 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| SwitchStmtBuild           |    34 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| SplitAndJoinOnSpace       |    55 |    PASS     |    FAIL     |    FAIL     |    FAIL     |    FAIL     |
| RegExNoCompile            |   120 |    PASS     |    PASS     |    PASS     |    PASS     |    PASS     |
| RegExBrandon              |   137 |    PASS     |    FAIL     |    PASS     |    PASS     |    PASS     |

Benchmark notes: Release Mode, no-debugger attached, i7 processor, avg of 4 runs, only short strings tested

SwitchStmtBuildSpaceOnly by Felipe Machado 2015 and modified by Sunsetquest

InPlaceCharArraySpaceOnly by Felipe Machado 2015 and modified by Sunsetquest

SwitchStmtBuild by Felipe Machado 2015 and modified by Sunsetquest

SwitchStmtBuild2 by Felipe Machado 2015 and modified by Sunsetquest

SingleSpacedTrim by David S 2013

Fubo(StringBuilder) by fubo 2014

SplitAndJoinOnSpace by Jon Skeet 2009

RegExWithCompile by Jon Skeet 2009

User214147 by user214147

RegExBrandon by Brandon

RegExNoCompile by Tim Hoolihan

Benchmark code is on Github

Share:
115,334

Related videos on Youtube

Matt
Author by

Matt

Hello

Updated on July 08, 2022

Comments

  • Matt
    Matt almost 2 years

    Let's say I have a string such as:

    "Hello     how are   you           doing?"
    

    I would like a function that turns multiple spaces into one space.

    So I would get:

    "Hello how are you doing?"
    

    I know I could use regex or call

    string s = "Hello     how are   you           doing?".replace("  "," ");
    

    But I would have to call it multiple times to make sure all sequential whitespaces are replaced with only one.

    Is there already a built in method for this?

    • Jon Skeet
      Jon Skeet almost 15 years
      Could you clarify: are you only dealing with spaces, or "all" whitespace?
    • Jon Skeet
      Jon Skeet almost 15 years
      And do you want any non-space whitespace to be converted into spaces?
    • Michael Freidgeim
      Michael Freidgeim almost 11 years
    • smirkingman
      smirkingman about 5 years
      2 things to consider: 1. char.IsWhiteSpace includes carriage-return, linefeed etc. 2. 'whitespace' is probably more accurately tested with Char.GetUnicodeCategory(ch) = Globalization.UnicodeCategory.SpaceSeparator
  • Scott Dorman
    Scott Dorman almost 15 years
    Using a regular expression introduces a lot of overhead that isn't necessary.
  • Jon Skeet
    Jon Skeet almost 15 years
    If the regular expression is compiled and cached, I'm not sure that has more overhead than splitting and joining, which could create loads of intermediate garbage strings. Have you done careful benchmarks of both approaches before assuming that your way is faster?
  • Tim Hoolihan
    Tim Hoolihan almost 15 years
    imo, avoiding regex if your comfortable with them is premature optimization
  • Tim Hoolihan
    Tim Hoolihan almost 15 years
    whitespace is undeclared here
  • 赵君君
    赵君君 almost 15 years
    I think I remember this question being asked awhile back on SO. IndexOf ignores certain characters that Replace doesn't. So the double space was always there, just never removed.
  • Jon Skeet
    Jon Skeet almost 15 years
    Speaking of overhead, why on earth are you calling source.ToCharArray() and then throwing away the result?
  • Daniel
    Daniel almost 15 years
    If you application isn't time critical, it can afford the 1 microsecond of processing overhead.
  • Jon Skeet
    Jon Skeet almost 15 years
    And calling ToCharArray() on the result of string.Join, only to create a new string... wow, for that to be in a post complaining of overhead is just remarkable. -1.
  • Jon Skeet
    Jon Skeet almost 15 years
    Oh, and assuming whitespace is new char[] { ' ' }, this will give the wrong result if the input string starts or ends with a space.
  • Scott Dorman
    Scott Dorman almost 15 years
    No, I've not done benchmarks, but I know there is higher overhead for RegEx compared to the Split and Join. From what it looks like Split and Join either use character buffers, treat the string as an array of characters or go through unsafe code to do pointer manipulations.
  • Scott Dorman
    Scott Dorman almost 15 years
    grrr...copied from a larger example...updated to reflect the comments.
  • Jon Skeet
    Jon Skeet almost 15 years
    "Knowing" there's a higher overhead for regexes isn't nearly as good as proving it with benchmarks. I'm running benchmarks now, and will post results soon.
  • Bart Kiers
    Bart Kiers almost 15 years
    Note that '\s' not only replaces white spaces, but also new line characters.
  • Tim Hoolihan
    Tim Hoolihan almost 15 years
    good catch, if you just want spaces switch the pattern to "[ ]+"
  • ahawker
    ahawker almost 15 years
    It is because IndexOf ignores some Unicode characters, the specific culprate in this case being some asian character iirc. Hmm, zero-width non-joiner according to the Google.
  • Scott Dorman
    Scott Dorman almost 15 years
    Great analysis. So it appears that we were both correct to varying degrees. The code in my answer was taken from a larger function which has the ability to normalize all whitespace and/or control characters from within a string and from the beginning and end.
  • Jon Skeet
    Jon Skeet almost 15 years
    With just the whitespace characters you specified, in most of my tests the regex and Split/Join were about equal - S/J had a tiny, tiny benefit, at the cost of correctness and complexity. For those reasons, I'd normally prefer the regex. Don't get me wrong - I'm far from a regex fanboy, but I don't like writing more complex code for the sake of performance without really testing the performance first.
  • angularsen
    angularsen over 12 years
    Shouldn't you use '{2,}' instead of '+' to avoid replacing single whitespaces?
  • Ian Ringrose
    Ian Ringrose over 10 years
    NormalizeWithSplitAndJoin will create a lot more garbage, it is hard to tell if a real problem will get hit more more GC time then the banchmark.
  • Antonio Bakula
    Antonio Bakula almost 9 years
    I learned that the hard way :( stackoverflow.com/questions/9260693/…
  • Herman
    Herman over 8 years
    This replaces all non-word characters with space. So it would also replace things like brackets and quotes etc, which might not be what you want.
  • Efrain
    Efrain over 7 years
    Exactly! To me this is the most elegant approach, too. So for the record, in C# that would be: string.Join(" ", myString.Split(' ').Where(s => s != " ").ToArray())
  • David
    David over 7 years
    Minor improvement on the Split to catch all whitespace and remove the Where clause: myString.Split(null as char[], StringSplitOptions.RemoveEmptyEntries)
  • Dronz
    Dronz about 6 years
    @IanRingrose What sort of garbage can be created?
  • David Specht
    David Specht over 5 years
    @angularsen My one issue with @"\s{2,}" is that it fails to replace single tabs and other Unicode space characters with a space. @"\s+" will do that for you.
  • David Specht
    David Specht over 5 years
    My one issue with @"\s{2,}" is that it fails to replace single tabs and other Unicode space characters with a space. If you are going to replace 2 tabs with a space, then you should probably replace 1 tab with a space. @"\s+" will do that for you.
  • Loudenvier
    Loudenvier about 5 years
    Nice to see my article referenced here! (I'm Felipe Machado) I'm about to update it using a proper benchmark tool called BenchmarkDotNet! I'll try to setup runs in all runtimes (now that we have DOT NET CORE and the likes...
  • SunsetQuest
    SunsetQuest about 5 years
    @Loudenvier - Nice work on this. Yours was the quickest by almost 400%! .Net Core is like a free 150-200% performance boost. It's getting closer to c++ performance but much easier to code. Thanks for the comment.
  • Martin Brabec
    Martin Brabec over 4 years
    I learned the hard way. Especialy with two Zero Width Non Joiners (\u200C\u200C). IndexOf returns index of this "double space", but Replace does not replaces it. I think it is because for IndexOf, you need to specify StringComparsion (Ordinal) to behave the same as Replace. This way, neither of these two will locate "double spaces". More about StringComparsion docs.microsoft.com/en-us/dotnet/api/…
  • Evil Pigeon
    Evil Pigeon almost 4 years
    This only does spaces, not other white space characters. Maybe you want char.IsWhiteSpace(ch) instead of src[i] == '\u0020'. I notice this has been edited by the community. Did they bork it up?