Java's Scanner vs String.split() vs StringTokenizer; which should I use?

23,865

Solution 1

Did some metrics around these in a single threaded model and here are the results I got.

~~~~~~~~~~~~~~~~~~Time Metrics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tokenizer  |   String.Split()   |    while+SubString  |    Scanner    |    ScannerWithCompiledPattern    ~
~   4.0 ms   |      5.1 ms        |        1.2 ms       |     0.5 ms    |                0.1 ms            ~
~   4.4 ms   |      4.8 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.2 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
~   3.5 ms   |      4.7 ms        |        1.1 ms       |     0.1 ms    |                0.1 ms            ~
____________________________________________________________________________________________________________

The out come is that Scanner gives the best performance, Now the same needs to be evaluated on a multithreaded mode ! One of my senior's say that the Tokenizer gives a CPU spike and String.split does not.

Solution 2

For processing line you can use scanner and for getting tokens from each line you can use split.

Scanner scanner = new Scanner(new File(loc));
try {
    while ( scanner.hasNextLine() ){
        String[] tokens = scanner.nextLine().split("~");
        // do the processing for tokens here
    }
}
finally {
    scanner.close();
}

Solution 3

You can use the useDelimiter("~") method to let you iterate through the tokens on each line with hasNext()/next(), while still using hasNextLine()/nextLine() to iterate through the lines themselves.

EDIT: If you're going to do a performance comparison, you should pre-compile the regex when you do the split() test:

Pattern splitRegex = Pattern.compile("~");
while ((line = bufferedReader.readLine()) != null)
{
  String[] tokens = splitRegex.split(line);
  // etc.
}

If you use String#split(String regex), the regex will be recompiled every time. (Scanner automatically caches all regexes the first time it compiles them.) If you do that, I wouldn't expect to see much difference in performance.

Solution 4

I would say split() is fastest, and probably good enough for what you're doing. It is less flexible than scanner though. StringTokenizer is deprecated and is only available for backwards compatibility, so don't use it.

EDIT: You could always test both implementations to see which one is faster. I'm curious myself if scanner could be faster than split(). Split might be faster for a given size VS Scanner, but I can't be certain of that.

Solution 5

You don't actually need a regex here, because you are splitting on a fixed string. Apache StringUtils split does splitting on plain strings.

For high volume splits, where the splitting is the bottleneck, rather than say file IO, I've found this to be up to 10 times faster than String.split(). However, I did not test it against a compiled regex.

Guava also has a splitter, implemented in a more OO way, but I found it was significantly slower than StringUtils for high volume splits.

Share:
23,865
Admin
Author by

Admin

Updated on June 15, 2020

Comments

  • Admin
    Admin almost 4 years

    I am currently using split() to scan through a file where each line has number of strings delimited by '~'. I read somewhere that Scanner could do a better job with a long file, performance-wise, so I thought about checking it out.

    My question is: Would I have to create two instances of Scanner? That is, one to read a line and another one based on the line to get tokens for a delimiter? If I have to do so, I doubt if I would get any advantage from using it. Maybe I am missing something here?

  • Leo
    Leo about 15 years
    I agree that StringTokenizer is possibly deprecated, but I did not find it in the list of deprecated classes for j2se5 and java6. Why?
  • CookieOfFortune
    CookieOfFortune about 15 years
    You're right, it isn't. But from the API: StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.