Most efficient way of splitting String in Java

9,997

Solution 1

StringTokenizer is faster than StringBuilder.

public static void main(String[] args) {

    String str = "This is String , split by StringTokenizer, created by me";
    StringTokenizer st = new StringTokenizer(str);

    System.out.println("---- Split by space ------");
    while (st.hasMoreElements()) {
        System.out.println(st.nextElement());
    }

    System.out.println("---- Split by comma ',' ------");
    StringTokenizer st2 = new StringTokenizer(str, ",");

    while (st2.hasMoreElements()) {
        System.out.println(st2.nextElement());
    }
}

Solution 2

This is the method I use for splitting large (1GB+) tab-separated files. It is limited to a char delimiter to avoid any overhead of additional method invocations (which may be optimized out by the runtime), but it can be easily converted to String-delimited. I'd be interested if anyone can come up with a faster method or improvements on this method.

public static String[] split(final String line, final char delimiter)
{
    CharSequence[] temp = new CharSequence[(line.length() / 2) + 1];
    int wordCount = 0;
    int i = 0;
    int j = line.indexOf(delimiter, 0); // first substring

    while (j >= 0)
    {
        temp[wordCount++] = line.substring(i, j);
        i = j + 1;
        j = line.indexOf(delimiter, i); // rest of substrings
    }

    temp[wordCount++] = line.substring(i); // last substring

    String[] result = new String[wordCount];
    System.arraycopy(temp, 0, result, 0, wordCount);

    return result;
}

Solution 3

If you want the ultimate in efficiency I wouldn't use Strings at all, let alone split them. I would do what compilers do: process the file a character at a time. Use a BufferedReader with a large buffer size, say 128kb, and read a char at a time, accumulating them into say a StringBuilder until you get a ; or line terminator.

Share:
9,997

Related videos on Youtube

user92038111111
Author by

user92038111111

Updated on June 04, 2022

Comments

  • user92038111111
    user92038111111 almost 2 years

    For the sake of this question, let's assume I have a String which contains the values Two;.Three;.Four (and so on) but the elements are separated by ;..

    Now I know there are multiple ways of splitting a string such as split() and StringTokenizer (being the faster one and works well) but my input file is around 1GB and I am looking for something slightly more efficient than StringTokenizer.

    After some research, I found that indexOf and substring are quite efficient but the examples only have single delimiters or results are returning only a single word/element.

    Sample code using indexOf and substring:

    String s = "quick,brown,fox,jumps,over,the,lazy,dog";
    int from = s.indexOf(',');
    int to = s.indexOf(',', from+1);
    String brown = s.substring(from+1, to);
    

    The above works for printing brown but how can I use indexOf and substring to split a line with multiple delimiters and display all the items as below.

    Expected output

    Two
    Three
    Four
    ....and so on
    
    • Buhake Sindi
      Buhake Sindi about 9 years
      What are you trying to achieve? Have you done tests on various test cases and see which is "efficient"?
    • Prashant
      Prashant about 9 years
      Just loop, indexOf() takes a start parameter which is supposed to be the last found index.
  • user92038111111
    user92038111111 about 9 years
    Okay will give this a try and report back. Thanks
  • user207421
    user207421 about 7 years
    @AvinashRaj Your comment has nothing to do with my answer. Don't post irrelevant comments here.
  • user207421
    user207421 about 7 years
    @AvinashRaj That doesn't have anything more to do with my answer than your previous comment.
  • Sport
    Sport about 3 years
    You can further improve this by obtaining all the indexes at once, as indexOf loops through the String
  • Parker
    Parker about 3 years
    @Sport Inside the loop, I start each search after the index of the previous occurrence (line.indexOf(delimiter, i)), so each character is only checked once. I could probably write an inline version of indexOf(char, int) to avoid the overhead of repeated method invocation.
  • Yonathan W'Gebriel
    Yonathan W'Gebriel almost 3 years
    According to JDK Docs, StringTokenizer is considered a Legacy class for a while now. The recommendation is to use String.split or something from java.util.regex package.