Fastest way to sum Nth column in text file

shell-script text-processing awk sed

7,540

Solution 1

$ datamash -t, count 3 sum 3 < file
3,604720

Some testing

$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000

real    0m2.851s
user    0m2.784s
sys     0m0.068s

$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000

real    0m0.967s
user    0m0.920s
sys     0m0.048s

$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000

real    0m3.394s
user    0m3.364s
sys     0m0.036s

$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000

real    0m1.679s
user    0m1.416s
sys     0m0.248s

$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000

real    0m0.815s
user    0m0.716s
sys     0m0.036s

So mawk and datamash appear to be the pick of the bunch.

Solution 2

Awk is a fast and performant tool for processing text files.

awk -F',' '{ sum += $3 }
           END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file

Sample output:

Sum of 3rd field: 604720. Total number of lines: 3

Conceptual note:
I must note that all those non-awk alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).

Solution 3

You could use cut to extract the field, paste to insert + signs between the numbers, and bc to sum them. You could use wc to count the lines.

I'm not certain what the performance would be over millions of lines, though. But the golden rule of performance is, don't guess, measure. You need to profile any solution to see if it provides the performance you need, and to determine whether any changes increase or decrease the performance, and by how much.

Here is a solution that sums a given field and prints the number of lines:

echo -n "Sum: "
cut -d, -f3 <file |paste -s -d+ |bc
echo -n "Lines: "
wc -l <file

Output:

Sum: 604720
Lines: 3

The field is specified by the -f# parameter to cut, here cut -f3.

7,540

Elifarley

Updated on September 18, 2022

Comments

Elifarley over 1 year
I have a CSV file (in which the field separator is indeed comma) with 8 columns and a few million rows. Here's a sample:
```
1000024447,38111220,201705,181359,0,12,1,3090
1064458324,38009543,201507,9,0,1,1,1298
1064458324,38009543,201508,9,0,2,1,90017
```
What's the fastest way to print the sum of all numbers in a given column, as well as the number of lines read? Can you explain what makes it faster?
- Jeff Schaller almost 6 years
  
  Don't forget that sed can add, too! unix.stackexchange.com/questions/36949/addition-with-sed
- Wildcard almost 6 years
  
  Do you mean literally fastest as in performance? Or fastest to type on the fly, i.e. simplest? (Fastest for the programmer or fastest for the computer?) How long is your file?
- Elifarley almost 6 years
  
  Good question. I was thinking about how fast it can run. It's about a few million rows.
David Conrad almost 6 years

What is the numeric range of integers in Awk? If OP has "a few million rows," will it overflow?
RomanPerekhrest almost 6 years

To be fair, I would suggest to skip printf function call in the above awk timings as the other tools don't give any text formatted outputting (they can't). Just awk -F',' '{ sum += $3 } END{ print sum, NR }' file
smw almost 6 years

@RomanPerekhrest updated
Sundeep almost 6 years

@DavidConrad gawk has gnu.org/software/gawk/manual/html_node/…, dunno about other versions
David Conrad almost 6 years

You might also want to try gawk with the -M option, based on Sundeep's comment on Roman's answer. This is an excellent answer. Thank you for introducing me to datamash.
Stéphane Chazelas almost 6 years

With LC_ALL=C gawk gives comparable timings to mawk. mawk doesn't support multi-byte characters.