What is a quick way to count lines in a 4TB file?

text-processing cat wc

5,843

Solution 1

If this information is not already present as meta data in a separate file (or embedded in the data, or available through a query to the system that you exported the data from) and if there is no index file of some description available, then the quickest way to count the number of lines is by using wc -l on the file.

You can not really do it quicker.

To count the number of records in the file, you will have to know what record separator is in used and use something like awk to count these. Again, that is if this information is not already stored elsewhere as meta data and if it's not available through a query to the originating system, and if the records themselves are not already enumerated and sorted within the file.

Solution 2

Looping over files is a job for AWK ... nothing can beat this speed

LINECOUNT=`awk '{next}; END { print FNR }' $FILE

[root@vmd28527 bin]# time LINECOUNT=`awk '{next}; END { print FNR }' $FILE`; echo $LINECOUNT

real    0m0.005s
user    0m0.001s
sys     0m0.004s
7168

5 msec for 7168 lines ... not bad ...

Solution 3

You should not use line based utilities such as awk and sed. These utilities will issue a read() system call for every line in the input file (see that answer on why this is so). If you have lots of lines, this will be a huge performance loss.

Since your file is 4TB in size, I guess that there are a lot of lines. So even wc -l will produce a lot of read() system calls, since it reads only 16384 bytes per call (on my system). Anyway this would be an improvement over awk and sed. The best method - unless you write your own program - might be just

cat file | wc -l

This is no useless use of cat, because cat reads chunks of 131072 bytes per read() system call (on my system) and wc -l will issue more, but not on the file directly, instead on the pipe. But however, cat tries to read as much as possible per system call.

5,843

Santosh Garole

Experience in all areas of Linux in VMware, KVM, Cloud and physical environment installation, configuration, networking, security, user & group management, file management, storage management, process management, performance, routine maintenance, system monitoring and trouble shooting skills, automation tools working experience in Docker, Jenkins, Git, Ansible, Shell/Bash Scripting.

Updated on September 18, 2022

Comments

Santosh Garole almost 2 years

I have a 4TB big text file Exported from Teradata records, and I want to know how many records (= lines in my case) there are in that file.

How may I do this quickly and efficiently?
- Panki over 5 years
  
  Is each line a record? If yes, you can just use wc -l
- Stephen Kitt over 5 years
  
  This doesn’t answer the stated question, but the fastest way would be to ask your Teradata system.
- Jeff Schaller over 5 years
  
  If the export happened to put a comment at the top, that'd make it pretty fast to find.
- Santosh Garole over 5 years
  
  I tried Using vim -R filename it took around 1.5 Hrs
pLumo over 5 years

Won't an io redirect be faster than cat and pipe ?
chaos over 5 years

@RoVo Could be, have you tried it?
pLumo over 5 years

Short test with 10 iterations of wc -l with a 701MB file: wc -l file 1.7s ;; wc -l < file 1.7s ;; cat file | wc -l 2.6s.
ilkkachu over 4 years

"These utilities will issue a read() system call for every line in the input file" -- That can't be true. read() only reads a bunch of bytes, it doesn't know how to read a line. The utilities might differ in the size of a buffer they use for read(), but that's not the same. It's likely that most utilities will read at least a couple of kB in one go, and that's usually enough for a few lines at minimum.
ilkkachu over 4 years

@StephenKitt, I'm not so sure about that. tail might well be smart enough to start reading from the end of the file. Of course that doesn't make it any more useful to do all that unnecessary work with the grep, or help with the fact that the text in the last line might also appear elsewhere in the file.
Stephen Kitt over 4 years

@ilkkachu ah yes, tail can indeed work backwards (and the GNU version does). If the text appears multiple times, fgrep will show multiple matches, but will still show the last line. The results might not be accurate if the last line’s contents aren’t obvious from tail’s output (e.g. an empty line or a line containing only whitespace).
Santosh Garole about 4 years

Thanks @Kusalananda for such great explainanation.
roaima almost 3 years

Personally, I find awk one of the slower tools. You may find that wc -l "$FILE" is significantly faster (almost double the speed in my tests)
Heinz-Peter Heidinger almost 3 years

You are right. For the simple purpose just counting lines of line-oriented file 'wc -l' is unbeatable in speed. But if it comes to traversing files and doing complex things then AWK is the 'tool-of-choice'. AWK can outperform a shell solution (even BASH with a lot of built-in functions) by a factor of 100 or even far more depending on the complexity of the task ...
Stéphane Chazelas over 2 years

But with cat file | wc -l, wc will still do its 16k read()s, this time on a pipe, and cat will do extra writes to that pipe, and the kernel will have to do extra work to shove bytes through that pipe, I can't see how that can improve matters.