How to replace multiple white spaces with one white space
Solution 1
string cleanedString = System.Text.RegularExpressions.Regex.Replace(dirtyString,@"\s+"," ");
Solution 2
This question isn't as simple as other posters have made it out to be (and as I originally believed it to be) - because the question isn't quite precise as it needs to be.
There's a difference between "space" and "whitespace". If you only mean spaces, then you should use a regex of " {2,}"
. If you mean any whitespace, that's a different matter. Should all whitespace be converted to spaces? What should happen to space at the start and end?
For the benchmark below, I've assumed that you only care about spaces, and you don't want to do anything to single spaces, even at the start and end.
Note that correctness is almost always more important than performance. The fact that the Split/Join solution removes any leading/trailing whitespace (even just single spaces) is incorrect as far as your specified requirements (which may be incomplete, of course).
The benchmark uses MiniBench.
using System;
using System.Text.RegularExpressions;
using MiniBench;
internal class Program
{
public static void Main(string[] args)
{
int size = int.Parse(args[0]);
int gapBetweenExtraSpaces = int.Parse(args[1]);
char[] chars = new char[size];
for (int i=0; i < size/2; i += 2)
{
// Make sure there actually *is* something to do
chars[i*2] = (i % gapBetweenExtraSpaces == 1) ? ' ' : 'x';
chars[i*2 + 1] = ' ';
}
// Just to make sure we don't have a \0 at the end
// for odd sizes
chars[chars.Length-1] = 'y';
string bigString = new string(chars);
// Assume that one form works :)
string normalized = NormalizeWithSplitAndJoin(bigString);
var suite = new TestSuite<string, string>("Normalize")
.Plus(NormalizeWithSplitAndJoin)
.Plus(NormalizeWithRegex)
.RunTests(bigString, normalized);
suite.Display(ResultColumns.All, suite.FindBest());
}
private static readonly Regex MultipleSpaces =
new Regex(@" {2,}", RegexOptions.Compiled);
static string NormalizeWithRegex(string input)
{
return MultipleSpaces.Replace(input, " ");
}
// Guessing as the post doesn't specify what to use
private static readonly char[] Whitespace =
new char[] { ' ' };
static string NormalizeWithSplitAndJoin(string input)
{
string[] split = input.Split
(Whitespace, StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", split);
}
}
A few test runs:
c:\Users\Jon\Test>test 1000 50
============ Normalize ============
NormalizeWithSplitAndJoin 1159091 0:30.258 22.93
NormalizeWithRegex 26378882 0:30.025 1.00
c:\Users\Jon\Test>test 1000 5
============ Normalize ============
NormalizeWithSplitAndJoin 947540 0:30.013 1.07
NormalizeWithRegex 1003862 0:29.610 1.00
c:\Users\Jon\Test>test 1000 1001
============ Normalize ============
NormalizeWithSplitAndJoin 1156299 0:29.898 21.99
NormalizeWithRegex 23243802 0:27.335 1.00
Here the first number is the number of iterations, the second is the time taken, and the third is a scaled score with 1.0 being the best.
That shows that in at least some cases (including this one) a regular expression can outperform the Split/Join solution, sometimes by a very significant margin.
However, if you change to an "all whitespace" requirement, then Split/Join does appear to win. As is so often the case, the devil is in the detail...
Solution 3
A regular expressoin would be the easiest way. If you write the regex the correct way, you wont need multiple calls.
Change it to this:
string s = System.Text.RegularExpressions.Regex.Replace(s, @"\s{2,}", " ");
Solution 4
While the existing answers are fine, I'd like to point out one approach which doesn't work:
public static string DontUseThisToCollapseSpaces(string text)
{
while (text.IndexOf(" ") != -1)
{
text = text.Replace(" ", " ");
}
return text;
}
This can loop forever. Anyone care to guess why? (I only came across this when it was asked as a newsgroup question a few years ago... someone actually ran into it as a problem.)
Solution 5
A fast extra whitespace remover by Felipe Machado. (Modified by RW for multi-space removal)
static string DuplicateWhiteSpaceRemover(string str)
{
var len = str.Length;
var src = str.ToCharArray();
int dstIdx = 0;
bool lastWasWS = false; //Added line
for (int i = 0; i < len; i++)
{
var ch = src[i];
switch (ch)
{
case '\u0020': //SPACE
case '\u00A0': //NO-BREAK SPACE
case '\u1680': //OGHAM SPACE MARK
case '\u2000': // EN QUAD
case '\u2001': //EM QUAD
case '\u2002': //EN SPACE
case '\u2003': //EM SPACE
case '\u2004': //THREE-PER-EM SPACE
case '\u2005': //FOUR-PER-EM SPACE
case '\u2006': //SIX-PER-EM SPACE
case '\u2007': //FIGURE SPACE
case '\u2008': //PUNCTUATION SPACE
case '\u2009': //THIN SPACE
case '\u200A': //HAIR SPACE
case '\u202F': //NARROW NO-BREAK SPACE
case '\u205F': //MEDIUM MATHEMATICAL SPACE
case '\u3000': //IDEOGRAPHIC SPACE
case '\u2028': //LINE SEPARATOR
case '\u2029': //PARAGRAPH SEPARATOR
case '\u0009': //[ASCII Tab]
case '\u000A': //[ASCII Line Feed]
case '\u000B': //[ASCII Vertical Tab]
case '\u000C': //[ASCII Form Feed]
case '\u000D': //[ASCII Carriage Return]
case '\u0085': //NEXT LINE
if (lastWasWS == false) //Added line
{
src[dstIdx++] = ' '; // Updated by Ryan
lastWasWS = true; //Added line
}
continue;
default:
lastWasWS = false; //Added line
src[dstIdx++] = ch;
break;
}
}
return new string(src, 0, dstIdx);
}
The benchmarks...
| | Time | TEST 1 | TEST 2 | TEST 3 | TEST 4 | TEST 5 |
| Function Name |(ticks)| dup. spaces | spaces+tabs | spaces+CR/LF| " " -> " " | " " -> " " |
|---------------------------|-------|-------------|-------------|-------------|-------------|-------------|
| SwitchStmtBuildSpaceOnly | 5.2 | PASS | FAIL | FAIL | PASS | PASS |
| InPlaceCharArraySpaceOnly | 5.6 | PASS | FAIL | FAIL | PASS | PASS |
| DuplicateWhiteSpaceRemover| 7.0 | PASS | PASS | PASS | PASS | PASS |
| SingleSpacedTrim | 11.8 | PASS | PASS | PASS | FAIL | FAIL |
| Fubo(StringBuilder) | 13 | PASS | FAIL | FAIL | PASS | PASS |
| User214147 | 19 | PASS | PASS | PASS | FAIL | FAIL |
| RegExWithCompile | 28 | PASS | FAIL | FAIL | PASS | PASS |
| SwitchStmtBuild | 34 | PASS | FAIL | FAIL | PASS | PASS |
| SplitAndJoinOnSpace | 55 | PASS | FAIL | FAIL | FAIL | FAIL |
| RegExNoCompile | 120 | PASS | PASS | PASS | PASS | PASS |
| RegExBrandon | 137 | PASS | FAIL | PASS | PASS | PASS |
Benchmark notes: Release Mode, no-debugger attached, i7 processor, avg of 4 runs, only short strings tested
SwitchStmtBuildSpaceOnly by Felipe Machado 2015 and modified by Sunsetquest
InPlaceCharArraySpaceOnly by Felipe Machado 2015 and modified by Sunsetquest
SwitchStmtBuild by Felipe Machado 2015 and modified by Sunsetquest
SwitchStmtBuild2 by Felipe Machado 2015 and modified by Sunsetquest
SingleSpacedTrim by David S 2013
Fubo(StringBuilder) by fubo 2014
SplitAndJoinOnSpace by Jon Skeet 2009
RegExWithCompile by Jon Skeet 2009
User214147 by user214147
RegExBrandon by Brandon
RegExNoCompile by Tim Hoolihan
Related videos on Youtube
Comments
-
Matt almost 2 years
Let's say I have a string such as:
"Hello how are you doing?"
I would like a function that turns multiple spaces into one space.
So I would get:
"Hello how are you doing?"
I know I could use regex or call
string s = "Hello how are you doing?".replace(" "," ");
But I would have to call it multiple times to make sure all sequential whitespaces are replaced with only one.
Is there already a built in method for this?
-
Jon Skeet almost 15 yearsCould you clarify: are you only dealing with spaces, or "all" whitespace?
-
Jon Skeet almost 15 yearsAnd do you want any non-space whitespace to be converted into spaces?
-
Michael Freidgeim almost 11 yearsPossible duplicate of stackoverflow.com/questions/206717/…
-
smirkingman about 5 years2 things to consider: 1. char.IsWhiteSpace includes carriage-return, linefeed etc. 2. 'whitespace' is probably more accurately tested with Char.GetUnicodeCategory(ch) = Globalization.UnicodeCategory.SpaceSeparator
-
-
Scott Dorman almost 15 yearsUsing a regular expression introduces a lot of overhead that isn't necessary.
-
Jon Skeet almost 15 yearsIf the regular expression is compiled and cached, I'm not sure that has more overhead than splitting and joining, which could create loads of intermediate garbage strings. Have you done careful benchmarks of both approaches before assuming that your way is faster?
-
Tim Hoolihan almost 15 yearsimo, avoiding regex if your comfortable with them is premature optimization
-
Tim Hoolihan almost 15 yearswhitespace is undeclared here
-
赵君君 almost 15 yearsI think I remember this question being asked awhile back on SO. IndexOf ignores certain characters that Replace doesn't. So the double space was always there, just never removed.
-
Jon Skeet almost 15 yearsSpeaking of overhead, why on earth are you calling
source.ToCharArray()
and then throwing away the result? -
Daniel almost 15 yearsIf you application isn't time critical, it can afford the 1 microsecond of processing overhead.
-
Jon Skeet almost 15 yearsAnd calling
ToCharArray()
on the result of string.Join, only to create a new string... wow, for that to be in a post complaining of overhead is just remarkable. -1. -
Jon Skeet almost 15 yearsOh, and assuming
whitespace
isnew char[] { ' ' }
, this will give the wrong result if the input string starts or ends with a space. -
Scott Dorman almost 15 yearsNo, I've not done benchmarks, but I know there is higher overhead for RegEx compared to the Split and Join. From what it looks like Split and Join either use character buffers, treat the string as an array of characters or go through unsafe code to do pointer manipulations.
-
Scott Dorman almost 15 yearsgrrr...copied from a larger example...updated to reflect the comments.
-
Jon Skeet almost 15 years"Knowing" there's a higher overhead for regexes isn't nearly as good as proving it with benchmarks. I'm running benchmarks now, and will post results soon.
-
Bart Kiers almost 15 yearsNote that '\s' not only replaces white spaces, but also new line characters.
-
Tim Hoolihan almost 15 yearsgood catch, if you just want spaces switch the pattern to "[ ]+"
-
ahawker almost 15 yearsIt is because IndexOf ignores some Unicode characters, the specific culprate in this case being some asian character iirc. Hmm, zero-width non-joiner according to the Google.
-
Scott Dorman almost 15 yearsGreat analysis. So it appears that we were both correct to varying degrees. The code in my answer was taken from a larger function which has the ability to normalize all whitespace and/or control characters from within a string and from the beginning and end.
-
Jon Skeet almost 15 yearsWith just the whitespace characters you specified, in most of my tests the regex and Split/Join were about equal - S/J had a tiny, tiny benefit, at the cost of correctness and complexity. For those reasons, I'd normally prefer the regex. Don't get me wrong - I'm far from a regex fanboy, but I don't like writing more complex code for the sake of performance without really testing the performance first.
-
angularsen over 12 yearsShouldn't you use '{2,}' instead of '+' to avoid replacing single whitespaces?
-
Ian Ringrose over 10 yearsNormalizeWithSplitAndJoin will create a lot more garbage, it is hard to tell if a real problem will get hit more more GC time then the banchmark.
-
Antonio Bakula almost 9 yearsI learned that the hard way :( stackoverflow.com/questions/9260693/…
-
Herman over 8 yearsThis replaces all non-word characters with space. So it would also replace things like brackets and quotes etc, which might not be what you want.
-
Efrain over 7 yearsExactly! To me this is the most elegant approach, too. So for the record, in C# that would be:
string.Join(" ", myString.Split(' ').Where(s => s != " ").ToArray())
-
David over 7 yearsMinor improvement on the
Split
to catch all whitespace and remove theWhere
clause:myString.Split(null as char[], StringSplitOptions.RemoveEmptyEntries)
-
Dronz about 6 years@IanRingrose What sort of garbage can be created?
-
David Specht over 5 years@angularsen My one issue with
@"\s{2,}"
is that it fails to replace single tabs and other Unicode space characters with a space.@"\s+"
will do that for you. -
David Specht over 5 yearsMy one issue with
@"\s{2,}"
is that it fails to replace single tabs and other Unicode space characters with a space. If you are going to replace 2 tabs with a space, then you should probably replace 1 tab with a space.@"\s+"
will do that for you. -
Loudenvier about 5 yearsNice to see my article referenced here! (I'm Felipe Machado) I'm about to update it using a proper benchmark tool called BenchmarkDotNet! I'll try to setup runs in all runtimes (now that we have DOT NET CORE and the likes...
-
SunsetQuest about 5 years@Loudenvier - Nice work on this. Yours was the quickest by almost 400%! .Net Core is like a free 150-200% performance boost. It's getting closer to c++ performance but much easier to code. Thanks for the comment.
-
Martin Brabec over 4 yearsI learned the hard way. Especialy with two Zero Width Non Joiners (\u200C\u200C). IndexOf returns index of this "double space", but Replace does not replaces it. I think it is because for IndexOf, you need to specify StringComparsion (Ordinal) to behave the same as Replace. This way, neither of these two will locate "double spaces". More about StringComparsion docs.microsoft.com/en-us/dotnet/api/…
-
Evil Pigeon almost 4 yearsThis only does spaces, not other white space characters. Maybe you want char.IsWhiteSpace(ch) instead of src[i] == '\u0020'. I notice this has been edited by the community. Did they bork it up?