Efficiently counting the number of lines of a text file. (200mb+)
Solution 1
This will use less memory, since it doesn't load the whole file into memory:
$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
$line = fgets($handle);
$linecount++;
}
fclose($handle);
echo $linecount;
fgets
loads a single line into memory (if the second argument $length
is omitted it will keep reading from the stream until it reaches the end of the line, which is what we want). This is still unlikely to be as quick as using something other than PHP, if you care about wall time as well as memory usage.
The only danger with this is if any lines are particularly long (what if you encounter a 2GB file without line breaks?). In which case you're better off doing slurping it in in chunks, and counting end-of-line characters:
$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
$line = fgets($handle, 4096);
$linecount = $linecount + substr_count($line, PHP_EOL);
}
fclose($handle);
echo $linecount;
Solution 2
Using a loop of fgets()
calls is fine solution and the most straightforward to write, however:
-
even though internally the file is read using a buffer of 8192 bytes, your code still has to call that function for each line.
-
it's technically possible that a single line may be bigger than the available memory if you're reading a binary file.
This code reads a file in chunks of 8kB each and then counts the number of newlines within that chunk.
function getLines($file)
{
$f = fopen($file, 'rb');
$lines = 0;
while (!feof($f)) {
$lines += substr_count(fread($f, 8192), "\n");
}
fclose($f);
return $lines;
}
If the average length of each line is at most 4kB, you will already start saving on function calls, and those can add up when you process big files.
Benchmark
I ran a test with a 1GB file; here are the results:
+-------------+------------------+---------+
| This answer | Dominic's answer | wc -l |
+------------+-------------+------------------+---------+
| Lines | 3550388 | 3550389 | 3550388 |
+------------+-------------+------------------+---------+
| Runtime | 1.055 | 4.297 | 0.587 |
+------------+-------------+------------------+---------+
Time is measured in seconds real time, see here what real means
True line count
While the above works well and returns the same results as wc -l
, if the file ends without a newline, the line number will be off by one; if you care about this particular scenario, you can make it more accurate by using this logic:
function getLines($file)
{
$f = fopen($file, 'rb');
$lines = 0; $buffer = '';
while (!feof($f)) {
$buffer = fread($f, 8192);
$lines += substr_count($buffer, "\n");
}
fclose($f);
if (strlen($buffer) > 0 && $buffer[-1] != "\n") {
++$lines;
}
return $lines;
}
Solution 3
Simple Oriented Object solution
$file = new \SplFileObject('file.extension');
while($file->valid()) $file->fgets();
var_dump($file->key());
#Update
Another way to make this is with PHP_INT_MAX
in SplFileObject::seek
method.
$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);
echo $file->key();
Solution 4
If you're running this on a Linux/Unix host, the easiest solution would be to use exec()
or similar to run the command wc -l $path
. Just make sure you've sanitized $path
first to be sure that it isn't something like "/path/to/file ; rm -rf /".
Solution 5
There is a faster way I found that does not require looping through the entire file
only on *nix systems, there might be a similar way on windows ...
$file = '/path/to/your.file';
//Get number of lines
$totalLines = intval(exec("wc -l '$file'"));
Related videos on Youtube
Abs
Updated on February 27, 2022Comments
-
Abs about 2 years
I have just found out that my script gives me a fatal error:
Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 440 bytes) in C:\process_txt.php on line 109
That line is this:
$lines = count(file($path)) - 1;
So I think it is having difficulty loading the file into memeory and counting the number of lines, is there a more efficient way I can do this without having memory issues?
The text files that I need to count the number of lines for range from 2MB to 500MB. Maybe a Gig sometimes.
Thanks all for any help.
-
Abs over 14 yearsThanks for the explanation Dominic - that looks good. I had a feeling it had to be done line by line and not letting count of file load the whole thing into memory!
-
Abs over 14 yearsI am on a windows machine! If I was, I think that would be the best solution!
-
David Schmitt over 14 yearsThe only danger of this snippet are huge files without linebreaks as fgets will then try to suck up the whole file. It'd be safer to read 4kB chunks at a time and count line termination characters.
-
Dominic Rodger over 14 years@David - how does my edit look? I'm not 100% confident about
PHP_EOL
- does that look right? -
nickf over 14 yearsnot perfect: you could have a unix-style file (
\n
) being parsed on a windows machine (PHP_EOL == '\r\n'
) -
Dominic Rodger over 14 years@nickf - good point. How would you address it? How does
fgets
work? -
Lawrence Hutton over 14 years@ghostdog74: Why, yes, you're right. It is non-portable. That's why I explicitly acknowledged my suggestion's non-portability by prefacing it with the clause "If you're running this on a Linux/Unix host...".
-
Manz over 11 yearsNon portable (though useful in some situations), but exec (or shell_exec or system) are a system call, which are considerably slower compared to PHP built-in functions.
-
Lawrence Hutton over 11 years@Manz: Why, yes, you're right. It is non-portable. That's why I explicitly acknowledged my suggestion's non-portability by prefacing it with the clause "If you're running this on a Linux/Unix host...".
-
Manz over 11 years@DaveSherohman Yes, you're right, sorry. IMHO, I think the most important issue is the time consuming in a system call (especially if you need to use frequently)
-
Tegan Snyder about 11 yearsadd 2>/dev/null to suppress the "No such file or directory"
-
NikiC over 10 yearsThe
try
/finally
is not strictly necessary, PHP will automatically close the file for you. You should probably also mention that the actual counting can be done usingiterator_count(getFiles($file))
:) -
pgee70 over 10 years$total_lines = intval(exec("wc -l '$file'")); will handle file names with spaces.
-
Andy Braham over 10 yearsThanks pgee70 didn't come across that yet but makes sense, I updated my answer
-
Dejan Marjanović over 10 years@Manz it is still 8 times faster (or more) on big files (see Jack's answer).
-
zerkms over 10 yearsCurious how faster (?) it will be if you extend the buffer size to something like 64k. PS: if only php had some easy way to make IO asynchronous in this case
-
Ja͢ck over 10 years@zerkms To answer your question, with 64kB buffers it becomes 0.2 seconds faster on 1GB :)
-
Parris Varney about 10 yearsThis does not work with CSVs created with Excel on MacBooks. They only have carriage returns, and no newline, for line terminators.
-
psobko about 10 yearsInteresting. What about skipping empty lines?
-
Zheng Kai almost 10 years
exec('wc -l '.escapeshellarg($file).' 2>/dev/null')
-
Oliver Charlesworth over 9 yearsBe careful with this benchmark, which did you run first? The second one will have the benefit of the file already being in disk cache, massively skewing the result.
-
Ja͢ck over 9 years@OliCharlesworth they're averages over five runs, skipping the first run :)
-
Cyril N. over 9 yearsWhy not improve a bit by limiting the line reading to 1 ? Since we only want to count the number of lines, why not do a
fgets($handle, 1);
? -
mgutt about 9 years@CyrilN. This depends on your setup. If you're having mostly files that contain only some chars per line it could be faster because you don't need to use
substr_count()
, but if you are having very long lines you need to callwhile()
andfgets()
much more causing a disadvantage. Do not forget:fgets()
does not read line by line. It reads only the amount of chars you defined through$length
and if it contains a linebreak it stops whatever$length
have been set. -
mgutt about 9 years@DominicRodger instead of using substr_count() you should use strpos() as
$line
will never include more than one linebreak. Or better use$last = strlen($line) - 1; if ($line[ $last ] == "\n" || $line[ $last ] == "\r") { $linecount++; }
. This should be the fastest option. -
Barmar about 9 yearsWon't this return 1 more than the number of lines?
while(!feof())
will cause you to read an extra line, because the EOF indicator isn't set until after you try to read at the end of file. -
β.εηοιτ.βε almost 9 yearsPlease consider adding at least some words explaining to the OP and to further readers of you answer why and how it does reply to the original question.
-
Daniele Orlando over 8 yearsThe second solution is great and uses Spl! Thanks.
-
Drasill over 8 yearsThank you ! This is, indeed, great. And faster than calling
wc -l
(because of the forking I suppose), especially on small files. -
Wallace Vizerra about 8 yearsI didn't thought that the solution would be so helpful!
-
Dalibor Karlović almost 8 yearsExcellent solution!
-
Pocketsand over 7 years@DominicRodger in the first example I believe
$line = fgets($handle);
could just befgets($handle);
because$line
is never used. -
ab3000 about 7 yearsFor the first solution: It counts an extra line because the loop runs once more than is necessary. To fix that, you need to move the
fgets
call to the end of the loop and clone it once above the loop as well. -
Valdrinium over 6 yearsThis is the best solution by far
-
Tuim over 6 yearsThe original solution was this. But since file() loads the entire file in memory this was also the original issue (Memory exhaustion) so no, this isn't a solution for the question.
-
caligari over 6 yearsThis answer is great! However, IMO, it must test when there is some character in the last line to add 1 in the line count: pastebin.com/yLwZqPR2
-
user9645 about 4 yearsIs the "key() + 1" right? I tried it and seems wrong. For a given file with line endings on every line including the last, this code gives me 3998. But if I do "wc" on it, I get 3997. If I use "vim", it says 3997L (and does not indicate missing EOL). So I think the "Update" answer is wrong.
-
Déjà vu about 4 yearsLooks like the answer by @DaveSherohman above posted 3 years before this one
-
Wallace Vizerra about 3 years@user9645 the
key
starts of zero value. Considering that file contains one line,key
will be returned0
, but the correct is1
-
Eaten by a Grue almost 3 years@WallaceMaxters - for whatever reason, this is wrong. I've tested on a zero length and 1 line file and removing the
+ 1
gets the correct line count regardless of file length. Great answer though - thanks! -
eyedmax over 2 yearsSecond function will return wrong count if last line contains some text, but no eol.
-
eyedmax over 2 yearsFunction will return wrong count if last line contains some text, but no eol.
-
Ja͢ck about 2 years@eyedmax Surprisingly (or maybe not so)
wc -l
outputs the same number of lines in that condition (i tested withecho -n "hello world" > file.txt
and both return 0)