Does any one know of a faster method to do String.Split()?

37,592

Solution 1

It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:

1,"Something, with a comma",2,3

The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.

That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.

So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.

The algorithm for that is relatively simple:

  • Stop at every comma;
  • When you hit quotes continue until you hit the next set of quotes;
  • Handle escaped quotes (ie \") and arguably escaped commas (\,).

Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.

So if you really want performance, it's time to hand parse.

Or, better yet, use someone else's optimized solution like this fast CSV reader.

By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.

Solution 2

The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.

But there's one thing you can do and that's to implement this as a generator:

public static IEnumerable<string> GetSplit( this string s, char c )
{
    int l = s.Length;
    int i = 0, j = s.IndexOf( c, 0, l );
    if ( j == -1 ) // No such substring
    {
        yield return s; // Return original and break
        yield break;
    }

    while ( j != -1 )
    {
        if ( j - i > 0 ) // Non empty? 
        {
            yield return s.Substring( i, j - i ); // Return non-empty match
        }
        i = j + 1;
        j = s.IndexOf( c, i, l - i );
    }

    if ( i < l ) // Has remainder?
    {
        yield return s.Substring( i, l - i ); // Return remaining trail
    }
}

The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.

The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.

Solution 3

Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.

string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();

Trade result = new Trade();

index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);

index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);

index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));

return result;

Note that a ReadOnlySpan.Split() will soon be part of the framework. See https://github.com/dotnet/runtime/pull/295

Solution 4

Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.

Hope this helps

Solution 5

I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:

The Solution

With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.

public int Split(string value, char separator)
{
    int resultIndex = 0;
    int startIndex = 0;

    // Find the mid-parts
    for (int i = 0; i < value.Length; i++)
    {
        if (value[i] == separator)
        {
            this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
            resultIndex++;
            startIndex = i + 1;
        }
    }

    // Find the last part
    this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
    resultIndex++;

    return resultIndex;

How To Use

The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.

var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
    Console.WriteLine("It works!");
}

The Split methods returns the number of items found, so you can easily iterate through the results like this:

var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
    Console.WriteLine(splitter.Results[i]);
}

This approach has advantages and disadvantages.

Share:
37,592

Related videos on Youtube

Admin
Author by

Admin

Updated on July 08, 2020

Comments

  • Admin
    Admin almost 4 years

    I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

    values = line.Split(delimiter);
    

    where line is the a string that holds the values that are seperated by the delimiter.

    Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.

    Thanks!

    • Admin
      Admin about 15 years
      Thanks for all the replies. Sorry my comment did not out right since this comment field seems to ignore new lines.
  • Dave Van den Eynde
    Dave Van den Eynde about 15 years
    Apparently, there is no need to save memory, but there is a need to save CPU.
  • Dave Van den Eynde
    Dave Van den Eynde about 15 years
    It will cause string data copies once you need to pass a StringShim to something that accepts a string. Unless your whole app works with shims instead.
  • Lordn__n
    Lordn__n about 15 years
    You can't assume that at all. I'll dig up the example using regex vs hand-coding where the regex solution was an order of magnitude slower.
  • John Leidegren
    John Leidegren about 15 years
    I've linked an answer on a similar topic about string replace methods, you'll find the link at the end of my own answer to this question.
  • John Leidegren
    John Leidegren about 15 years
    @Dave Van den Eynde - I think it's important to do both! But yeah, memory optimization are greatly overlooked by most programmers.
  • MSalters
    MSalters about 15 years
    My point is that it's hard to be faster with the same interface. My StringShim solution is quite explicit changing the split() interface to make things faster.
  • torial
    torial about 15 years
    I did an approach similar to this, and it was slower than the existing algorithm that used Split, but because we were processing such large strings (multiple Megabytes), it saved about 30% on the RAM consumption.
  • John Leidegren
    John Leidegren about 15 years
    You know, that code isn't optimized, and the reason why string.Split is faster is because it uses unsafe code. If you include that here, the running time is the same. Except this is a lot more memory efficent.
  • kemiller2002
    kemiller2002 about 13 years
    I just wanted to say thanks. You reaffirmed what I thought, and forced me to go through my code again and look through where I was being inefficient. Turns out I had a conditional statement in the wrong order, and I think I would have just called it a day without seeing your post.
  • Gary Brunton
    Gary Brunton about 10 years
    I know this is old but I thought I would point out that this solution appears to be removing empty items from the returned collection. Calling "1,,3".GetSplit(',') returns a collection containing only 2 items. A 1 and a 3. This is different behavior then .net's split method.
  • John Leidegren
    John Leidegren about 10 years
    Yeah, that's by design. If you don't want that behavior by default you'd have to adjust the code.
  • Bhargav Rao
    Bhargav Rao about 8 years
    Whilst this may theoretically answer the question, it would be preferable to include the essential parts of the answer here, and provide the link for reference.
  • Peter Hlavatík
    Peter Hlavatík almost 8 years
    In excel generated csv, escaped quotes are "", not \"
  • Krythic
    Krythic over 7 years
    Almost every .NET function is designed for multiple-case scenarios, thus if you can be certain of the data, you can build a tailored function that will always perform better than the default .NET implementation. I downvoted your answer because reinventing the wheel is not always a bad thing, despite what the internet would like to see you regurgitate.
  • Stanislav Prusac
    Stanislav Prusac over 7 years
    Article is deleted, this is a archived version of article on dotnetperls.com : web.archive.org/web/20090316210342/http://dotnetperls.com/…
  • OzBob
    OzBob about 7 years
    It's back on dotnetperls: dotnetperls.com/split My findings: 10000000 Regex.split's are 10% slower than 10000000 string.Split's (.net framework 4)
  • Akmal Salikhov
    Akmal Salikhov almost 5 years
    What about nowadays and Span<T>?
  • Joe Phillips
    Joe Phillips over 3 years
    Very clever! Exactly the type of situation this method was made for, I imagine