Best way to split string into lines

194,807

Solution 1

  • If it looks ugly, just remove the unnecessary ToCharArray call.

  • If you want to split by either \n or \r, you've got two options:

    • Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:

      var result = text.Split(new [] { '\r', '\n' });
      
    • Use a regular expression, as indicated by Bart:

      var result = Regex.Split(text, "\r\n|\r|\n");
      
  • If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Solution 2

using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}

Solution 3

Update: See here for an alternative/async solution.


This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Solution 4

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Solution 5

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());
Share:
194,807

Related videos on Youtube

Konstantin Spirin
Author by

Konstantin Spirin

Passionate software developer

Updated on May 06, 2021

Comments

  • Konstantin Spirin
    Konstantin Spirin about 3 years

    How do you split multi-line string into lines?

    I know this way

    var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    

    looks a bit ugly and loses empty lines. Is there a better solution?

    • Robin Bennett
      Robin Bennett almost 5 years
    • Caius Jard
      Caius Jard over 2 years
      Yes, you use the exact line delimiter present in the file, e.g. just "\r\n" or just "\n" rather than using either \r or \n and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw?
  • Konrad Rudolph
    Konrad Rudolph over 14 years
    This won’t work on OS X style text files though, since these use only \r as line ending.
  • Bart Kiers
    Bart Kiers over 14 years
    @Konrad Rudolph: AFAIK, '\r' was used on very old MacOS systems and is almost never encountered anymore. But if the OP needs to account for it (or if I'm mistaken), then the regex can easily be extended to account for it of course: \r?\n|\r
  • Konstantin Spirin
    Konstantin Spirin over 14 years
    Removing ToCharArray will make code platform-specific (NewLine can be '\n')
  • Konstantin Spirin
    Konstantin Spirin over 14 years
    NewLine can be '\n' and input text can contain "\n\r".
  • Konrad Rudolph
    Konrad Rudolph over 14 years
    @Bart: I don’t think you’re mistaken but I have repeatedly encountered all possible line endings in my career as a programmer.
  • Bart Kiers
    Bart Kiers over 14 years
    @Konrad, you're probably right. Better safe than sorry, I guess.
  • Admin
    Admin over 13 years
    @Kon you should use Environment.NewLine if that is your concern. Or do you mean the origin of the text, rather than the location of execution?
  • Konrad Rudolph
    Konrad Rudolph over 13 years
    @Will: on the off chance that you were referring to me instead of Konstantin: I believe (strongly) that parsing code should strive to work on all platforms (i.e. it should also read text files that were encoded on different platforms than the executing platform). So for parsing, Environment.NewLine is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly.
  • Admin
    Admin over 13 years
    lol didn't notice the name similarity. I agree completely in this case.
  • Konrad Rudolph
    Konrad Rudolph over 12 years
    @Hamish Well just look at the documentation of the enum, or look in the original question! It’s StringSplitOptions.RemoveEmptyEntries.
  • Hamish Grubijan
    Hamish Grubijan over 12 years
    Ah I see, my bad, I was looking within RegexOptions; have not had my coffee yet.
  • username
    username about 12 years
    How about the text that contains '\r\n\r\n'. string.Split will return 4 empty lines, however with '\r\n' it should give 2. It gets worse if '\r\n' and '\r' are mixed in one file.
  • Konrad Rudolph
    Konrad Rudolph about 12 years
    @SurikovPavel Use the regular expression. That is definitely the preferred variant, as it works correctly with any combination of line endings.
  • primo
    primo over 10 years
    This is the cleanest approach, in my subjective opinion.
  • Mohit Jain
    Mohit Jain almost 10 years
    Please add some more details to make your answer more useful for readers.
  • orad
    orad over 9 years
    Done. Also added a test to compare its performance with Regex solution.
  • ΩmegaMan
    ΩmegaMan about 9 years
    Less backtracking and same functionality with [\r\n]{1,2}
  • ΩmegaMan
    ΩmegaMan about 9 years
    Somewhat faster pattern due to less backtracking with the same functionality if one uses [\r\n]{1,2}
  • orad
    orad about 9 years
    @OmegaMan That has some different behavior. It will match \n\r or \n\n as single line-break which is not correct.
  • ΩmegaMan
    ΩmegaMan about 9 years
    @orad I won't argue with you, but if the data has line feeds in multiple numbers...there most likely is something wrong with the data; let us call it an edge case.
  • Brandin
    Brandin over 8 years
    @OmegaMan How is Hello\n\nworld\n\n an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line.
  • James Holwell
    James Holwell over 6 years
    I do wonder if this is because you aren't actually inspecting the results of the enumerator, and therefore it isn't getting executed. Unfortunately, I'm too lazy to check.
  • JCH2k
    JCH2k over 6 years
    Yes, it actually is!! When you add .ToList() to both the calls, the StringReader solution is actually slower! On my machine it is 6.74s vs. 5.10s
  • orad
    orad over 6 years
    That makes sense. I still prefer this method because it lets me to get lines asynchronously.
  • JCH2k
    JCH2k over 6 years
    Maybe you should remove the "better solution" header on your other answer and edit this one...
  • Ken Clement
    Ken Clement over 6 years
    A minor point - I usually go with the verbatim string literal in the second argument to Regex.Split, i.e. - var result = Regex.Split(text, @"\r\n|\r|\n"); In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    Just adding my 2c worth. Since the OP wants to keep blank lines, you can't write a parser that works for any type of environment and/or handles mixed cases (i.e. the RegEx), because if you have '\n\r' how do you know it's one 'break' instead of two that are just encoded wrong? If it's the latter, it would be two blank lines, but if it's the former, he would only be one. You have to ask what is the source of the encodings. If the source is on the same platform as the parser (regardless of what platform it is) then you can use Environment.NewLine as the source is known.
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV There are different possible answers to this, all valid. One is to expect and require consistent text files. Another one is to not accept "\r" on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are "\r\n" and "\n". In fact, your example ("\n\r") has never been a valid line break anywhere. Either read it as two line breaks or throw an error, but certainly don’t treat it as a single line break.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    First things first, my text was a typo. Use '\r\n' and my point is still the same: you can't write a universal parser on a system if you're required to keep blank lines. Note that by adding the restriction that you're not to accepting '\r' by itself, and you only want to use '\n' to detect new lines, with that change, you no longer have a universal parser essentially proving my point that without such limitations, it can't (easily*) be done, and chances are doesn't need to be in the first place. (*It can playing with RegEx ordering and such, but that just makes it much slower.)
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV I think you misread my comment: since "\r" is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on "\r\n|\n". There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete "\r".
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    If you have input that has mixed styles like you said, there's no way to differentiate between '\n\r' and '\n' and '\r' without making the assumption that there will never be an '\r', and when you make that assumption, then you've removed the condition that I just mentioned that causes the ambiguity. Plus, you can't make that assumption anyway as there are plenty of embedded hardware systems that use '\r'. That's why terminals give you three choices for line breaks. You need to know you're input up front. I guess we'll just have to disagree and each use what works for us.
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV That’s why my previous comment says “in practice” it works. You’re arguing from a pretty unlikely case. Yes, obviously such cases are ambiguous but I contend that they are not relevant enough to care, and these ambiguities are fundamentally unresolvable, anyway: no parsing strategy will work since the ambiguity is then in the data itself, not in the parsing process.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    But I believe you just made my point for me. That's exactly why I just use Environment.NewLine by default, and only use something like the RegEx solution if you venture outside the realm of the more-likely scenarios. It happens, but as they say, a giant time-killer is implementing solutions for things that might happen, rather than things that do. Sure, plan for the future of course (i.e. don't design yourself into a corner where you can't make the change later), but don't actually implement a future until you actually need to. In other words, I don't think our points are that far off.
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV “That's exactly why I just use Environment.NewLine” — but that’s the worst thing you can do because now you start breaking lots of actual files, whereas my solution breaks approximately zero actually existing files. Check out how many modern text editors use only the system’s newline for line breaks (hint: none do).
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    Nothing is broken if you're never planning on getting anything that doesn't match your platform's encoding. If you know that (just like you know there may never be a '\r') then you're optimizing your results, not wasting time running things through a RegEx engine that don't need to be, which can kill a time-critical application. If you will have multiple encodings, then use the RegEx. You just can't do universal. Again, I don't think we're arguing the same point. You've made yours and I've made a different one. Tangential, but not in contradiction.
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV I honestly have trouble understanding your use-case: You don’t need to go beyond your current platform to encounter text files that use different line ending conventions. I know for a fact that my current system contains files with different conventions (I edited one just yesterday, and I only know about the diverging line endings because diff flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    Plus, taking a step back, one could argue that if you do need blank lines but don't enforce a standard for line encodings, then you're just asking for trouble anyway. After all, if you skip blank lines, you can write a universal parser, rendering this entire convo thread obsolete! :)
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    And in your case, I'd argue the 'platform' is you using editing tools that may have differing line endings, hence you getting your diff. But if you're using a known format for instance, from another system, and not something manually edited, then there's no need to plan for that case and you can increase throughput of processing by not. Again, we're not arguing the same point!. Time and place. If you're taking in user-editable files, then I 100% agree with you. But if you're taking in system-generated files from a known system on the same platform, then I stand by my original statement. :)
  • Konrad Rudolph
    Konrad Rudolph over 5 years
    @MarqueIV No, nothing was mangled. The files have different (but internally consistent) line endings because they were created by different people, on different platforms. Yet they end up on my machine. — And I want to emphasise that we are very much arguing the same point, because I’m fundamentally not understanding where your potential use-case exists. I simply don’t see when it would be more useful, and produce less problems, to split on a platform hard-coded newline rather than using my heuristic, which I (and clearly many others) have found to work in 100% of real files.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    "Created by different people, on different platforms". That is a different use-case than something say from a web service where the line endings are predictable and consistent. And if that system is on the same platform, then you can use Environment.NewLine and crush the performance of RegEx. Again, time and place. I plan for, but don't implement solutions for things until they happen. Just like the code, developer productivity is also increased.
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    To hopefully appease you, if you're saying you need a system that has to detect blank lines, and you are taking files created on platforms with differing line endings, and you're guaranteeing you will never get '\r' by itself and/or your line endings will be consistent in the same file (which you can't if it's edited on machines with two different line endings and all line endings aren't updated), then I agree... the regex works. But I'm saying if you can't make those guarantees, it won't because you then won't be able to differentiate between '\n\r' and '\n' and '\r'. Make sense?
  • Mark A. Donohoe
    Mark A. Donohoe over 5 years
    In fairness, nothing will work in that case, not just RegEx because there is no standard for the line endings on the parser, which brings me back to one of my earlier points, if you are saying blank lines are important to you, then you must define what represents a blank line or you can't answer the above question (without those other guarantees that is.)
  • Mic
    Mic over 5 years
    More precision might help: it is not possible to write a parser to handle a combination of all cases, the RE here will handle combinations of any two cases in one file.
  • Uwe Keim
    Uwe Keim over 5 years
    Any idea in terms of performance (compared to string.Split or Regex.Split)?
  • Mike Rosoft
    Mike Rosoft about 5 years
    @ΩmegaMan: That will lose empty lines, e.g. \n\n.
  • Konstantin Spirin
    Konstantin Spirin over 3 years
    Interesting! Should it implement IEnumerable<>?
  • Alielson Piffer
    Alielson Piffer over 2 years
    I like this solution a lot, but I found a minor problem: when the last line is empty, it's ignored (only the last one). So, "example" and "example\r\n" will both produce only one line while "example\r\n\r\n" will produce two lines. This behavior is discussed here: github.com/dotnet/runtime/issues/27715