Best way to split string into lines

c# string syntax multiline

194,807

Solution 1

If it looks ugly, just remove the unnecessary ToCharArray call.
If you want to split by either \n or \r, you've got two options:
- Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:
```
var result = text.Split(new [] { '\r', '\n' });
```
- Use a regular expression, as indicated by Bart:
```
var result = Regex.Split(text, "\r\n|\r|\n");
```
If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Solution 2

using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}

Solution 3

Update: See here for an alternative/async solution.

This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Solution 4

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Solution 5

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());

View more solutions

194,807

Konstantin Spirin

Passionate software developer

Updated on May 06, 2021

Comments

Konstantin Spirin about 3 years
How do you split multi-line string into lines?

I know this way
```
var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
```
looks a bit ugly and loses empty lines. Is there a better solution?
- Robin Bennett almost 5 years
  
  Possible duplicate of Easiest way to split a string on newlines in .NET?
- Caius Jard over 2 years
  
  Yes, you use the exact line delimiter present in the file, e.g. just "\r\n" or just "\n" rather than using either \r or \n and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw?
Konrad Rudolph over 14 years

This won’t work on OS X style text files though, since these use only \r as line ending.
Bart Kiers over 14 years

@Konrad Rudolph: AFAIK, '\r' was used on very old MacOS systems and is almost never encountered anymore. But if the OP needs to account for it (or if I'm mistaken), then the regex can easily be extended to account for it of course: \r?\n|\r
Konstantin Spirin over 14 years

Removing ToCharArray will make code platform-specific (NewLine can be '\n')
Konstantin Spirin over 14 years

NewLine can be '\n' and input text can contain "\n\r".
Konrad Rudolph over 14 years

@Bart: I don’t think you’re mistaken but I have repeatedly encountered all possible line endings in my career as a programmer.
Bart Kiers over 14 years

@Konrad, you're probably right. Better safe than sorry, I guess.
Admin over 13 years

@Kon you should use Environment.NewLine if that is your concern. Or do you mean the origin of the text, rather than the location of execution?
Konrad Rudolph over 13 years

@Will: on the off chance that you were referring to me instead of Konstantin: I believe (strongly) that parsing code should strive to work on all platforms (i.e. it should also read text files that were encoded on different platforms than the executing platform). So for parsing, Environment.NewLine is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly.
Admin over 13 years

lol didn't notice the name similarity. I agree completely in this case.
Konrad Rudolph over 12 years

@Hamish Well just look at the documentation of the enum, or look in the original question! It’s StringSplitOptions.RemoveEmptyEntries.
Hamish Grubijan over 12 years

Ah I see, my bad, I was looking within RegexOptions; have not had my coffee yet.
username about 12 years

How about the text that contains '\r\n\r\n'. string.Split will return 4 empty lines, however with '\r\n' it should give 2. It gets worse if '\r\n' and '\r' are mixed in one file.
Konrad Rudolph about 12 years

@SurikovPavel Use the regular expression. That is definitely the preferred variant, as it works correctly with any combination of line endings.
primo over 10 years

This is the cleanest approach, in my subjective opinion.
Mohit Jain almost 10 years

Please add some more details to make your answer more useful for readers.
orad over 9 years

Done. Also added a test to compare its performance with Regex solution.
ΩmegaMan about 9 years

Less backtracking and same functionality with [\r\n]{1,2}
ΩmegaMan about 9 years

Somewhat faster pattern due to less backtracking with the same functionality if one uses [\r\n]{1,2}
orad about 9 years

@OmegaMan That has some different behavior. It will match \n\r or \n\n as single line-break which is not correct.
ΩmegaMan about 9 years

@orad I won't argue with you, but if the data has line feeds in multiple numbers...there most likely is something wrong with the data; let us call it an edge case.
Brandin over 8 years

@OmegaMan How is Hello\n\nworld\n\n an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line.
James Holwell over 6 years

I do wonder if this is because you aren't actually inspecting the results of the enumerator, and therefore it isn't getting executed. Unfortunately, I'm too lazy to check.
JCH2k over 6 years

Yes, it actually is!! When you add .ToList() to both the calls, the StringReader solution is actually slower! On my machine it is 6.74s vs. 5.10s
orad over 6 years

That makes sense. I still prefer this method because it lets me to get lines asynchronously.
JCH2k over 6 years

Maybe you should remove the "better solution" header on your other answer and edit this one...
Ken Clement over 6 years

A minor point - I usually go with the verbatim string literal in the second argument to Regex.Split, i.e. - var result = Regex.Split(text, @"\r\n|\r|\n"); In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems.
Mark A. Donohoe over 5 years

Just adding my 2c worth. Since the OP wants to keep blank lines, you can't write a parser that works for any type of environment and/or handles mixed cases (i.e. the RegEx), because if you have '\n\r' how do you know it's one 'break' instead of two that are just encoded wrong? If it's the latter, it would be two blank lines, but if it's the former, he would only be one. You have to ask what is the source of the encodings. If the source is on the same platform as the parser (regardless of what platform it is) then you can use Environment.NewLine as the source is known.
Konrad Rudolph over 5 years

@MarqueIV There are different possible answers to this, all valid. One is to expect and require consistent text files. Another one is to not accept "\r" on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are "\r\n" and "\n". In fact, your example ("\n\r") has never been a valid line break anywhere. Either read it as two line breaks or throw an error, but certainly don’t treat it as a single line break.
Mark A. Donohoe over 5 years

First things first, my text was a typo. Use '\r\n' and my point is still the same: you can't write a universal parser on a system if you're required to keep blank lines. Note that by adding the restriction that you're not to accepting '\r' by itself, and you only want to use '\n' to detect new lines, with that change, you no longer have a universal parser essentially proving my point that without such limitations, it can't (easily*) be done, and chances are doesn't need to be in the first place. (*It can playing with RegEx ordering and such, but that just makes it much slower.)
Konrad Rudolph over 5 years

@MarqueIV I think you misread my comment: since "\r" is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on "\r\n|\n". There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete "\r".
Mark A. Donohoe over 5 years

If you have input that has mixed styles like you said, there's no way to differentiate between '\n\r' and '\n' and '\r' without making the assumption that there will never be an '\r', and when you make that assumption, then you've removed the condition that I just mentioned that causes the ambiguity. Plus, you can't make that assumption anyway as there are plenty of embedded hardware systems that use '\r'. That's why terminals give you three choices for line breaks. You need to know you're input up front. I guess we'll just have to disagree and each use what works for us.
Konrad Rudolph over 5 years

@MarqueIV That’s why my previous comment says “in practice” it works. You’re arguing from a pretty unlikely case. Yes, obviously such cases are ambiguous but I contend that they are not relevant enough to care, and these ambiguities are fundamentally unresolvable, anyway: no parsing strategy will work since the ambiguity is then in the data itself, not in the parsing process.
Mark A. Donohoe over 5 years

But I believe you just made my point for me. That's exactly why I just use Environment.NewLine by default, and only use something like the RegEx solution if you venture outside the realm of the more-likely scenarios. It happens, but as they say, a giant time-killer is implementing solutions for things that might happen, rather than things that do. Sure, plan for the future of course (i.e. don't design yourself into a corner where you can't make the change later), but don't actually implement a future until you actually need to. In other words, I don't think our points are that far off.
Konrad Rudolph over 5 years

@MarqueIV “That's exactly why I just use Environment.NewLine” — but that’s the worst thing you can do because now you start breaking lots of actual files, whereas my solution breaks approximately zero actually existing files. Check out how many modern text editors use only the system’s newline for line breaks (hint: none do).
Mark A. Donohoe over 5 years

Nothing is broken if you're never planning on getting anything that doesn't match your platform's encoding. If you know that (just like you know there may never be a '\r') then you're optimizing your results, not wasting time running things through a RegEx engine that don't need to be, which can kill a time-critical application. If you will have multiple encodings, then use the RegEx. You just can't do universal. Again, I don't think we're arguing the same point. You've made yours and I've made a different one. Tangential, but not in contradiction.
Konrad Rudolph over 5 years

@MarqueIV I honestly have trouble understanding your use-case: You don’t need to go beyond your current platform to encounter text files that use different line ending conventions. I know for a fact that my current system contains files with different conventions (I edited one just yesterday, and I only know about the diverging line endings because diff flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now.
Mark A. Donohoe over 5 years

Plus, taking a step back, one could argue that if you do need blank lines but don't enforce a standard for line encodings, then you're just asking for trouble anyway. After all, if you skip blank lines, you can write a universal parser, rendering this entire convo thread obsolete! :)
Mark A. Donohoe over 5 years

And in your case, I'd argue the 'platform' is you using editing tools that may have differing line endings, hence you getting your diff. But if you're using a known format for instance, from another system, and not something manually edited, then there's no need to plan for that case and you can increase throughput of processing by not. Again, we're not arguing the same point!. Time and place. If you're taking in user-editable files, then I 100% agree with you. But if you're taking in system-generated files from a known system on the same platform, then I stand by my original statement. :)
Konrad Rudolph over 5 years

@MarqueIV No, nothing was mangled. The files have different (but internally consistent) line endings because they were created by different people, on different platforms. Yet they end up on my machine. — And I want to emphasise that we are very much arguing the same point, because I’m fundamentally not understanding where your potential use-case exists. I simply don’t see when it would be more useful, and produce less problems, to split on a platform hard-coded newline rather than using my heuristic, which I (and clearly many others) have found to work in 100% of real files.
Mark A. Donohoe over 5 years

"Created by different people, on different platforms". That is a different use-case than something say from a web service where the line endings are predictable and consistent. And if that system is on the same platform, then you can use Environment.NewLine and crush the performance of RegEx. Again, time and place. I plan for, but don't implement solutions for things until they happen. Just like the code, developer productivity is also increased.
Mark A. Donohoe over 5 years

To hopefully appease you, if you're saying you need a system that has to detect blank lines, and you are taking files created on platforms with differing line endings, and you're guaranteeing you will never get '\r' by itself and/or your line endings will be consistent in the same file (which you can't if it's edited on machines with two different line endings and all line endings aren't updated), then I agree... the regex works. But I'm saying if you can't make those guarantees, it won't because you then won't be able to differentiate between '\n\r' and '\n' and '\r'. Make sense?
Mark A. Donohoe over 5 years

In fairness, nothing will work in that case, not just RegEx because there is no standard for the line endings on the parser, which brings me back to one of my earlier points, if you are saying blank lines are important to you, then you must define what represents a blank line or you can't answer the above question (without those other guarantees that is.)
Mic over 5 years

More precision might help: it is not possible to write a parser to handle a combination of all cases, the RE here will handle combinations of any two cases in one file.
Uwe Keim over 5 years

Any idea in terms of performance (compared to string.Split or Regex.Split)?
Mike Rosoft about 5 years

@ΩmegaMan: That will lose empty lines, e.g. \n\n.
Konstantin Spirin over 3 years

Interesting! Should it implement IEnumerable<>?
Alielson Piffer over 2 years

I like this solution a lot, but I found a minor problem: when the last line is empty, it's ignored (only the last one). So, "example" and "example\r\n" will both produce only one line while "example\r\n\r\n" will produce two lines. This behavior is discussed here: github.com/dotnet/runtime/issues/27715