Number of lines in a file in Java

450,332

Solution 1

This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

EDIT, 9 1/2 years later: I have practically no java experience, but anyways I have tried to benchmark this code against the LineNumberReader solution below since it bothered me that nobody did it. It seems that especially for large files my solution is faster. Although it seems to take a few runs until the optimizer does a decent job. I've played a bit with the code, and have produced a new version that is consistently fastest:

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];

        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }

        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        // count remaining characters
        while (readChars != -1) {
            System.out.println(readChars);
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

Benchmark resuls for a 1.3GB text file, y axis in seconds. I've performed 100 runs with the same file, and measured each run with System.nanoTime(). You can see that countLinesOld has a few outliers, and countLinesNew has none and while it's only a bit faster, the difference is statistically significant. LineNumberReader is clearly slower.

Benchmark Plot

Solution 2

I have implemented another solution to the problem, I found it more efficient in counting rows:

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}

Solution 3

The accepted answer has an off by one error for multi line files which don't end in newline. A one line file ending without a newline would return 1, but a two line file ending without a newline would return 1 too. Here's an implementation of the accepted solution which fixes this. The endsWithoutNewLine checks are wasteful for everything but the final read, but should be trivial time wise compared to the overall function.

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

Solution 4

With , you can use streams:

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}

Solution 5

The answer with the method count() above gave me line miscounts if a file didn't have a newline at the end of the file - it failed to count the last line in the file.

This method works better for me:

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}
Share:
450,332

Related videos on Youtube

Mark
Author by

Mark

Flex, C#, Python, JAVA and hate it

Updated on July 31, 2020

Comments

  • Mark
    Mark almost 4 years

    I use huge data files, sometimes I only need to know the number of lines in these files, usually I open them up and read them line by line until I reach the end of the file

    I was wondering if there is a smarter way to do that

  • Esko
    Esko over 15 years
    Interesting downvote, no matter what command line tool you're using they all DO THE SAME THING anyway, only internally. There's no magic way to figure out the number of lines, they have to be counted by hand. Sure it can be saved as metadata but that's a whole another story...
  • martinus
    martinus over 15 years
    you were right david, I thought the JVM would be good enough for this... I have updated the code, this one is faster.
  • Alexander Aleksandrovič Klimov
    Alexander Aleksandrovič Klimov over 15 years
    @IainmH, your second suggestion just counts the number of entries in the current directory. Not what was intended? (or asked for by the OP)
  • Rémi Vennereau
    Rémi Vennereau over 15 years
    Sounds like a neat idea. Anyone tried it and has a regexp for it?
  • PhiLho
    PhiLho over 15 years
    @IainMH: that's what wc does anyway (reading the file, counting line-ending).
  • PhiLho
    PhiLho over 15 years
    I doubt it is such a good idea: it will need to read the whole file at once (martinus avoids this) and regexes are overkill (and slower) for such usage (simple search of fixed char(s)).
  • Iain Holder
    Iain Holder over 15 years
    @PhiLho You'd have to use the -l switch to count the lines. (Don't you? - it's been a while)
  • Iain Holder
    Iain Holder over 15 years
    @Paul - you are of course 100% right. My only defence is that I posted that before my coffee. I'm as sharp as a button now. :D
  • wds
    wds over 15 years
    BufferedInputStream should be doing the buffering for you, so I don't see how using an intermediate byte[] array will make it any faster. You're unlikely to do much better than using readLine() repeatedly anyway (since that will be optimized towards by the API).
  • martinus
    martinus over 15 years
    Ive benchmarked it with and without the buffered inputstream, and it is afaster when using it.
  • bendin
    bendin almost 15 years
    You're going to close that InputStream when you're done with it, aren't you?
  • Vishy
    Vishy almost 15 years
    If buffering helped it would because BufferedInputStream buffers 8K by default. Increase your byte[] to this size or larger and you can drop the BufferedInputStream. e.g. try 1024*1024 bytes.
  • newguy
    newguy about 13 years
    Works good until I use it on some MAC format files or some files in which the last line doesn't have a '\n' character. The number will be incorrect in those situations. Although it is fast but I think I will stick to to "fit-all" readLine() method.
  • David Schmitt
    David Schmitt about 13 years
    @will: what about /\n/ ? @PhiLo: Regex Executors are highly-tuned performance machines. Except the read-everything-into-memory caveat, I don't think that a manual implementation can be faster.
  • Adam Norberg
    Adam Norberg almost 13 years
    An int can hold values of up to, approximately, 2 billion. If you are loading a file with more than 2 billion lines, you have an overflow problem. That said, if you are loading an unindexed text file with more than two billion lines, you probably have other problems.
  • Nathan Ryan
    Nathan Ryan over 11 years
    Two things: (1) The definition of a line terminator in Java source is a carriage return, a line feed, or a carriage return followed by a line feed. Your solution won't work for CR used as a line terminator. Granted, the only OS of which I can think that uses CR as the default line terminator is Mac OS prior to Mac OS X. (2) Your solution assumes a character encoding such as US-ASCII or UTF-8. The line count may be inaccurate for encodings such as UTF-16.
  • serg.nechaev
    serg.nechaev over 10 years
    @Nathan_Ryan: I just got logs from java app outputting some mainframe TCP service responses and there were a number of CRs inside. The program using the snippet above gracefully failed.
  • Ryan
    Ryan over 10 years
    Good catch. Not sure why you didn't just edit the accepted answer and make a note in a comment though. Most people won't read down this far.
  • DMulligan
    DMulligan over 10 years
    @Ryan , it just didn't feel right to edit a 4 year old accepted answer with 90+ upvotes.
  • Sebastian
    Sebastian over 10 years
    @AFinkelstein, I feel that is what makes this site so great, that you can edit the top voted answer.
  • Syed Aqeel Ashiq
    Syed Aqeel Ashiq over 10 years
    In this case, there is no need of using LineNumberReader, simply use BufferedReader, in that case you'l have flexibility to use long datatype for cnt.
  • Simon Brandhof
    Simon Brandhof over 10 years
    This solution does not handle carriage return (\r) and carriage return followed by a linefeed (\r\n)
  • doc
    doc about 10 years
    Nice. I would make this method static and rename it countLines. Cheers
  • nckbrz
    nckbrz about 10 years
    @Simon Brandhof, I'm confused on why a carriage return would be counted as another line? A "\n" is a Carriage return line feed, so whoever writes "\r\n" is not understanding something... Plus he is searching char by char, so I'm pretty sure if someone were to use "\r\n" it would still catch the "\n" and count the line. Either way I think he made the point just fine. However, their are many scenarios where this is not a sufficient way to get a line count.
  • Peter
    Peter about 9 years
    For what it's worth, I already had the byte[] and used the following: ` private int countLines(byte[] file) throws IOException { InputStream is = new ByteArrayInputStream(file);`
  • Ernestas Gruodis
    Ernestas Gruodis about 9 years
    This method shows one line less... Try to look at my answer below.
  • Ernestas Gruodis
    Ernestas Gruodis about 9 years
    Code has errors. Simple, but very slow... Try to look at my answer below (above).
  • aw-think
    aw-think about 9 years
    This isn't correct. Made some experiments with your code and the method is always slower. Stream<String> - Time consumed: 122796351 Stream<String> - Num lines: 109808 Method - Time consumed: 12838000 Method - Num lines: 1 And the number of lines is even wrong too
  • Ernestas Gruodis
    Ernestas Gruodis about 9 years
    I tested on 32-bit machine. Maybe on 64-bit would be different results.. And it was the difference 10 times or more as I remember. Could you post the text to count line somewhere? You can use Notepad2 to see line breaks for convenience.
  • aw-think
    aw-think about 9 years
    That could be the difference.
  • Christian Hujer
    Christian Hujer about 9 years
    It will fail on files which use something else than something which includes \n as a line terminator. The count is off by one (one less) for noeol files. What actually needs to be counted is not the number of \n but the number of occurrences of character sequences separated by line terminator.
  • epb
    epb about 9 years
    LineNumberReader's lineNumber field is an integer... Won't it just wrap for files longer than Integer.MAX_VALUE? Why bother skipping by a long here?
  • Alexander Torstling
    Alexander Torstling about 8 years
    Adding one to the count is actually incorrect. wc -l counts the number of newline chars in the file. This works since every line is terminated with a newline, including the final line in a file. Every line has a newline character, including the empty lines, hence that the number of newline chars == number of lines in a file. Now, the lineNumber variable in FileNumberReader also represents the number of newline chars seen. It starts at zero, before any newline has been found, and is increased with every newline char seen. So don't add one to the line number please.
  • Alexander Torstling
    Alexander Torstling about 8 years
    @PB_MLT: Although you are right that a file with a single line without newline would be reported as 0 lines, this is how wc -l also reports this kind of file. Also see stackoverflow.com/questions/729692/…
  • Alexander Torstling
    Alexander Torstling about 8 years
    @PB_MLT: You get the opposite problem if the file consists solely of a newline. Your suggested algo would return 0 and wc -l would return 1. I concluded that all methods have flaws, and implemented one based on how I would like it to behave, see my other answer here.
  • user4321
    user4321 over 7 years
    a try with resources is a better way to do this. try(InputStream is = new BufferedInputStream(new FileInputStream(filename))){ //rest of the code as above without the finally block }
  • Holger
    Holger over 7 years
    If you care about performance, you should not use a BufferedInputStream when you are going to read into your own buffer anyway. Besides, even if your method might have a slight performance advantage, it looses flexibility, as it doesn’t support sole \r line terminators (old MacOS) anymore and doesn’t support every encoding.
  • amstegraf
    amstegraf over 7 years
    I've down voted this response, because it seems none of you have benchmarked it
  • user3181500
    user3181500 over 6 years
    Awesome code... for 400mb text file, it took just a second. Thanks alot @martinus
  • Chhorn Elit
    Chhorn Elit over 4 years
    [INFO] PMD Failure:xx:19 Rule:EmptyWhileStmt Priority:3 Avoid empty while statements.