Java split String performances

53,381

Solution 1

String.split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient. StringTokenizer is not much faster in this particular case.

This was introduced in OpenJDK7/OracleJDK7. Here's a bug report and a commit. I've made a simple benchmark here.


$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

$ java Split
split_banthar: 1231
split_tskuzzy: 1464
split_tskuzzy2: 1742
string.split: 1291
StringTokenizer: 1517

Solution 2

If you can use third-party libraries, Guava's Splitter doesn't incur the overhead of regular expressions when you don't ask for it, and is very fast as a general rule. (Disclosure: I contribute to Guava.)

Iterable<String> split = Splitter.on('/').split(string);

(Also, Splitter is as a rule much more predictable than String.split.)

Solution 3

StringTokenizer is much faster for simple parsing like this (I did some benchmarking with a while back and you get huge speedups).

StringTokenizer st = new StringTokenizer("1/2/3","/");
String[] arr = new String[st.countTokens()];
arr[0] = st.nextToken();

If you want to eek out a little more performance, you can do it manually as well:

String s = "1/2/3"
char[] c = s.toCharArray();
LinkedList<String> ll = new LinkedList<String>();
int index = 0;

for(int i=0;i<c.length;i++) {
    if(c[i] == '/') {
        ll.add(s.substring(index,i));
        index = i+1;
    }
}

String[] arr = ll.size();
Iterator<String> iter = ll.iterator();
index = 0;

for(index = 0; iter.hasNext(); index++)
    arr[index++] = iter.next();

Solution 4

Seeing as I am working at large scale, I thought it would help to provide some more benchmarking, including a few of my own implementations (I split on spaces, but this should illustrate how long it takes in general):

I'm working with a 426 MB file, with 2622761 lines. The only whitespace are normal spaces (" ") and lines ("\n").

First I replace all lines with spaces, and benchmark parsing one huge line:

.split(" ")
Cumulative time: 31.431366952 seconds

.split("\s")
Cumulative time: 52.948729489 seconds

splitStringChArray()
Cumulative time: 38.721338004 seconds

splitStringChList()
Cumulative time: 12.716065893 seconds

splitStringCodes()
Cumulative time: 1 minutes, 21.349029036000005 seconds

splitStringCharCodes()
Cumulative time: 23.459840685 seconds

StringTokenizer
Cumulative time: 1 minutes, 11.501686094999997 seconds

Then I benchmark splitting line by line (meaning that the functions and loops are done many times, instead of all at once):

.split(" ")
Cumulative time: 3.809014174 seconds

.split("\s")
Cumulative time: 7.906730124 seconds

splitStringChArray()
Cumulative time: 4.06576739 seconds

splitStringChList()
Cumulative time: 2.857809996 seconds

Bonus: splitStringChList(), but creating a new StringBuilder every time (the average difference is actually more like .42 seconds):
Cumulative time: 3.82026621 seconds

splitStringCodes()
Cumulative time: 11.730249921 seconds

splitStringCharCodes()
Cumulative time: 6.995555826 seconds

StringTokenizer
Cumulative time: 4.500008172 seconds

Here is the code:

// Use a char array, and count the number of instances first.
public static String[] splitStringChArray(String str, StringBuilder sb) {
    char[] strArray = str.toCharArray();
    int count = 0;
    for (char c : strArray) {
        if (c == ' ') {
            count++;
        }
    }
    String[] splitArray = new String[count+1];
    int i=0;
    for (char c : strArray) {
        if (c == ' ') {
            splitArray[i] = sb.toString();
            sb.delete(0, sb.length());
        } else {
            sb.append(c);
        }
    }
    return splitArray;
}

// Use a char array but create an ArrayList, and don't count beforehand.
public static ArrayList<String> splitStringChList(String str, StringBuilder sb) {
    ArrayList<String> words = new ArrayList<String>();
    words.ensureCapacity(str.length()/5);
    char[] strArray = str.toCharArray();
    int i=0;
    for (char c : strArray) {
        if (c == ' ') {
            words.add(sb.toString());
            sb.delete(0, sb.length());
        } else {
            sb.append(c);
        }
    }
    return words;
}

// Using an iterator through code points and returning an ArrayList.
public static ArrayList<String> splitStringCodes(String str) {
    ArrayList<String> words = new ArrayList<String>();
    words.ensureCapacity(str.length()/5);
    IntStream is = str.codePoints();
    OfInt it = is.iterator();
    int cp;
    StringBuilder sb = new StringBuilder();
    while (it.hasNext()) {
        cp = it.next();
        if (cp == 32) {
            words.add(sb.toString());
            sb.delete(0, sb.length());
        } else {
            sb.append(cp);
        }
    }

    return words;
}

// This one is for compatibility with supplementary or surrogate characters (by using Character.codePointAt())
public static ArrayList<String> splitStringCharCodes(String str, StringBuilder sb) {
    char[] strArray = str.toCharArray();
    ArrayList<String> words = new ArrayList<String>();
    words.ensureCapacity(str.length()/5);
    int cp;
    int len = strArray.length;
    for (int i=0; i<len; i++) {
        cp = Character.codePointAt(strArray, i);
        if (cp == ' ') {
            words.add(sb.toString());
            sb.delete(0, sb.length());
        } else {
            sb.append(cp);
        }
    }

    return words;
}

This is how I used StringTokenizer:

    StringTokenizer tokenizer = new StringTokenizer(file.getCurrentString());
    words = new String[tokenizer.countTokens()];
    int i = 0;
    while (tokenizer.hasMoreTokens()) {
        words[i] = tokenizer.nextToken();
        i++;
    }

Solution 5

java.util.StringTokenizer(String str, String delim) is about twice as fast according to this post.

However, unless your application is of a gigantic scale, split should be fine for you (c.f. same post, it cites thousands of strings in a few miliseconds).

Share:
53,381
Matthieu Napoli
Author by

Matthieu Napoli

I am a software engineer passionate about code and human interactions around it. I like to work with great people, learn and get things done. You can read more about me on my blog or on my GitHub profile. Here are some projects I'm working on: bref.sh: deploy PHP on AWS Lambda to create serverless applications PHP-DI - Dependency injection library for PHP externals.io @matthieunapoli

Updated on July 09, 2022

Comments

  • Matthieu Napoli
    Matthieu Napoli almost 2 years

    Here is the current code in my application:

    String[] ids = str.split("/");
    

    When profiling the application, a non-negligeable time is spent string splitting. Also, the split method takes a regular expression, which is superfluous here.

    What alternative can I use in order to optimize the string splitting? Is StringUtils.split faster?

    (I would've tried and tested myself but profiling my application takes a lot of time.)

  • Nandkumar Tekale
    Nandkumar Tekale almost 12 years
    StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
  • tskuzzy
    tskuzzy almost 12 years
    Just because it's legacy doesn't mean it's not useful. And in fact, this particular class is actually very useful for that extra performance boost so I am actually against this "legacy" label.
  • Louis Wasserman
    Louis Wasserman almost 12 years
    The split method of String and the java.util.regex package incur the significant overhead of using regexes. StringTokenizer does not.
  • Voo
    Voo almost 12 years
    @unknown Considering that the whole post is about the simple fact that split is horribly inefficient (it is, I had a simple parser that used split heavily and goodness apart from IO split dominated the whole thing) because of the additional complexities what exactly do you propose?
  • Nandkumar Tekale
    Nandkumar Tekale almost 12 years
    @tskuzzy it doesn't matter whether you are against "legacy" label or not, as javadoc says: its use discouraged.
  • Matthieu Napoli
    Matthieu Napoli almost 12 years
    Oh OK I'm using Java 5 (unfortunately yeah, can't change that)
  • Atul Darne
    Atul Darne over 10 years
    how about using a regex ??
  • WestCoastProjects
    WestCoastProjects over 10 years
    @NandkumarTekale Why the dogmatic "legacy" discussion If ST is much faster , it is much faster. Until the "recommended" code is equal or better in speed the label is potentially harmful.
  • Nandkumar Tekale
    Nandkumar Tekale over 10 years
    @javadba : I would like you to go through StringTokenizer javadoc. Search for "legacy" using ctrl+f
  • WestCoastProjects
    WestCoastProjects over 10 years
    @NandkumarTekale You did not apparently understand my point. But if you want to avoid using "legacy" classes in favor of "slow" ones that is your choice.
  • rupps
    rupps over 9 years
    it doesn't take a gigantic-scale application, a split in a tight loop such as a document parser is enough -and frequent- Think about typical routines of parsing twitterlinks, emails, hashtags .... They are fed with Mb of text to parse. The routine itself can have a few dozen lines but will be called hundreds of times per second.
  • John Humphreys
    John Humphreys over 9 years
    This made a very significant difference for me while using it on the lines from a large file.
  • sirvon
    sirvon over 9 years
    This post recommends the non-use of Iterable even Guava's team lead says so...alexruiz.developerblogs.com/?p=2519
  • Yossi Farjoun
    Yossi Farjoun over 7 years
    thanks for this benchmark. Your code is "unfair" though since the StringTokenizer part avoids creating a List and converting it to an array....great starting point though!
  • Systemsplanet
    Systemsplanet almost 6 years
    splitStringChList discards the last string. Add before return: java if (sb.length() > 0) words.add(sb.toString()); Also: - replace sb.delete(0, sb.length()); with sb.setLength(0); - remove unused int i=0;
  • andrii
    andrii almost 6 years
    to avoid regex creation inside split method, having 1 char long pattern isn't enough. This char also must not be one of the regex meta characters ".$|()[{^?*+\\" e.g. split(".") will create/compile regex pattern. (verified on jdk8 at least)
  • madhairsilence
    madhairsilence over 4 years
    @NandkumarTekale. Tha reason why its called as legacy , is you cannot use stringtokenizer for splitting complex. ST just considers a delimiter and splits. Even if you repeat the delimiter n times. it would take them as just one. The only reason why its not deprecated is , there aint no serious security or memory flaw in it. In papers, even deprecated code is discouraged but not blocked
  • Luke
    Luke over 4 years
    Also you should just make a string from a range in the char array rather than use a StringBuilder. I don't find your implementation to be faster than split in java11
  • peq
    peq over 3 years
    The blog entry has vanished but there is a snapshot available in the internet archive.
  • David Bradley
    David Bradley about 3 years
    In my version of Java 8 it does. From the split implementation comment: fastpath if the regex is a (1) one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter.
  • David Bradley
    David Bradley about 3 years
    Adding a qualification. If you just put in say "|" that's going to be treated as regular expression. But "\\|" Is not treated as a regular expression. That confused me a bit at first.
  • marcolopes
    marcolopes over 2 years
    At least the split_banthar (tested with copy/paste code) does NOT have the same behaviour has the JAVA SPLIT...