Fastest way to sum Nth column in text file
Solution 1
$ datamash -t, count 3 sum 3 < file
3,604720
Some testing
$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000
real 0m2.851s
user 0m2.784s
sys 0m0.068s
$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000
real 0m0.967s
user 0m0.920s
sys 0m0.048s
$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000
real 0m3.394s
user 0m3.364s
sys 0m0.036s
$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000
real 0m1.679s
user 0m1.416s
sys 0m0.248s
$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000
real 0m0.815s
user 0m0.716s
sys 0m0.036s
So mawk
and datamash
appear to be the pick of the bunch.
Solution 2
Awk
is a fast and performant tool for processing text files.
awk -F',' '{ sum += $3 }
END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file
Sample output:
Sum of 3rd field: 604720. Total number of lines: 3
Conceptual note:
I must note that all those non-awk
alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298
) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).
Solution 3
You could use cut
to extract the field, paste
to insert +
signs between the numbers, and bc
to sum them. You could use wc
to count the lines.
I'm not certain what the performance would be over millions of lines, though. But the golden rule of performance is, don't guess, measure. You need to profile any solution to see if it provides the performance you need, and to determine whether any changes increase or decrease the performance, and by how much.
Here is a solution that sums a given field and prints the number of lines:
echo -n "Sum: "
cut -d, -f3 <file |paste -s -d+ |bc
echo -n "Lines: "
wc -l <file
Output:
Sum: 604720
Lines: 3
The field is specified by the -f#
parameter to cut
, here cut -f3
.
Related videos on Youtube
Elifarley
Updated on September 18, 2022Comments
-
Elifarley over 1 year
I have a CSV file (in which the field separator is indeed comma) with 8 columns and a few million rows. Here's a sample:
1000024447,38111220,201705,181359,0,12,1,3090 1064458324,38009543,201507,9,0,1,1,1298 1064458324,38009543,201508,9,0,2,1,90017
What's the fastest way to print the sum of all numbers in a given column, as well as the number of lines read? Can you explain what makes it faster?
-
Jeff Schaller almost 6 yearsDon't forget that sed can add, too! unix.stackexchange.com/questions/36949/addition-with-sed
-
Wildcard almost 6 yearsDo you mean literally fastest as in performance? Or fastest to type on the fly, i.e. simplest? (Fastest for the programmer or fastest for the computer?) How long is your file?
-
Elifarley almost 6 yearsGood question. I was thinking about how fast it can run. It's about a few million rows.
-
-
David Conrad almost 6 yearsWhat is the numeric range of integers in Awk? If OP has "a few million rows," will it overflow?
-
RomanPerekhrest almost 6 yearsTo be fair, I would suggest to skip
printf
function call in the aboveawk
timings as the other tools don't give any text formatted outputting (they can't). Justawk -F',' '{ sum += $3 } END{ print sum, NR }' file
-
smw almost 6 years@RomanPerekhrest updated
-
Sundeep almost 6 years@DavidConrad gawk has gnu.org/software/gawk/manual/html_node/…, dunno about other versions
-
David Conrad almost 6 yearsYou might also want to try gawk with the
-M
option, based on Sundeep's comment on Roman's answer. This is an excellent answer. Thank you for introducing me to datamash. -
Stéphane Chazelas almost 6 yearsWith LC_ALL=C gawk gives comparable timings to mawk. mawk doesn't support multi-byte characters.