What is a quick way to count lines in a 4TB file?
Solution 1
If this information is not already present as meta data in a separate file (or embedded in the data, or available through a query to the system that you exported the data from) and if there is no index file of some description available, then the quickest way to count the number of lines is by using wc -l
on the file.
You can not really do it quicker.
To count the number of records in the file, you will have to know what record separator is in used and use something like awk
to count these. Again, that is if this information is not already stored elsewhere as meta data and if it's not available through a query to the originating system, and if the records themselves are not already enumerated and sorted within the file.
Solution 2
Looping over files is a job for AWK ... nothing can beat this speed
LINECOUNT=`awk '{next}; END { print FNR }' $FILE
[root@vmd28527 bin]# time LINECOUNT=`awk '{next}; END { print FNR }' $FILE`; echo $LINECOUNT
real 0m0.005s
user 0m0.001s
sys 0m0.004s
7168
5 msec for 7168 lines ... not bad ...
Solution 3
You should not use line based utilities such as awk
and sed
. These utilities will issue a read()
system call for every line in the input file (see that answer on why this is so). If you have lots of lines, this will be a huge performance loss.
Since your file is 4TB in size, I guess that there are a lot of lines. So even wc -l
will produce a lot of read()
system calls, since it reads only 16384
bytes per call (on my system). Anyway this would be an improvement over awk
and sed
. The best method - unless you write your own program - might be just
cat file | wc -l
This is no useless use of cat, because cat
reads chunks of 131072
bytes per read()
system call (on my system) and wc -l
will issue more, but not on the file directly, instead on the pipe. But however, cat
tries to read as much as possible per system call.
Related videos on Youtube
![Santosh Garole](https://lh3.googleusercontent.com/-ziLvY1ZAzds/AAAAAAAAAAI/AAAAAAAAABY/LzKjQS3Ksoc/photo.jpg?sz=256)
Santosh Garole
Experience in all areas of Linux in VMware, KVM, Cloud and physical environment installation, configuration, networking, security, user & group management, file management, storage management, process management, performance, routine maintenance, system monitoring and trouble shooting skills, automation tools working experience in Docker, Jenkins, Git, Ansible, Shell/Bash Scripting.
Updated on September 18, 2022Comments
-
Santosh Garole almost 2 years
I have a 4TB big text file Exported from Teradata records, and I want to know how many records (= lines in my case) there are in that file.
How may I do this quickly and efficiently?
-
Panki over 5 yearsIs each line a record? If yes, you can just use
wc -l
-
Stephen Kitt over 5 yearsThis doesn’t answer the stated question, but the fastest way would be to ask your Teradata system.
-
Jeff Schaller over 5 yearsIf the export happened to put a comment at the top, that'd make it pretty fast to find.
-
Santosh Garole over 5 yearsI tried Using vim -R filename it took around 1.5 Hrs
-
-
pLumo over 5 yearsWon't an io redirect be faster than
cat
and pipe ? -
chaos over 5 years@RoVo Could be, have you tried it?
-
pLumo over 5 yearsShort test with 10 iterations of
wc -l
with a 701MB file:wc -l file
1.7s ;;wc -l < file
1.7s ;;cat file | wc -l
2.6s. -
ilkkachu over 4 years"These utilities will issue a read() system call for every line in the input file" -- That can't be true.
read()
only reads a bunch of bytes, it doesn't know how to read a line. The utilities might differ in the size of a buffer they use forread()
, but that's not the same. It's likely that most utilities will read at least a couple of kB in one go, and that's usually enough for a few lines at minimum. -
ilkkachu over 4 years@StephenKitt, I'm not so sure about that.
tail
might well be smart enough to start reading from the end of the file. Of course that doesn't make it any more useful to do all that unnecessary work with thegrep
, or help with the fact that the text in the last line might also appear elsewhere in the file. -
Stephen Kitt over 4 years@ilkkachu ah yes,
tail
can indeed work backwards (and the GNU version does). If the text appears multiple times,fgrep
will show multiple matches, but will still show the last line. The results might not be accurate if the last line’s contents aren’t obvious fromtail
’s output (e.g. an empty line or a line containing only whitespace). -
Santosh Garole about 4 yearsThanks @Kusalananda for such great explainanation.
-
roaima almost 3 yearsPersonally, I find
awk
one of the slower tools. You may find thatwc -l "$FILE"
is significantly faster (almost double the speed in my tests) -
Heinz-Peter Heidinger almost 3 yearsYou are right. For the simple purpose just counting lines of line-oriented file 'wc -l' is unbeatable in speed. But if it comes to traversing files and doing complex things then AWK is the 'tool-of-choice'. AWK can outperform a shell solution (even BASH with a lot of built-in functions) by a factor of 100 or even far more depending on the complexity of the task ...
-
Stéphane Chazelas over 2 yearsBut with
cat file | wc -l
,wc
will still do its 16kread()
s, this time on a pipe, andcat
will do extra writes to that pipe, and the kernel will have to do extra work to shove bytes through that pipe, I can't see how that can improve matters.