Fastest way to sum Nth column in text file

7,540

Solution 1

GNU datamash

$ datamash -t, count 3 sum 3 < file
3,604720

Some testing

$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000

real    0m2.851s
user    0m2.784s
sys     0m0.068s

$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000

real    0m0.967s
user    0m0.920s
sys     0m0.048s

$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000

real    0m3.394s
user    0m3.364s
sys     0m0.036s

$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000

real    0m1.679s
user    0m1.416s
sys     0m0.248s

$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000

real    0m0.815s
user    0m0.716s
sys     0m0.036s

So mawk and datamash appear to be the pick of the bunch.

Solution 2

Awk is a fast and performant tool for processing text files.

awk -F',' '{ sum += $3 }
           END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file

Sample output:

Sum of 3rd field: 604720. Total number of lines: 3

Conceptual note:
I must note that all those non-awk alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).

Solution 3

You could use cut to extract the field, paste to insert + signs between the numbers, and bc to sum them. You could use wc to count the lines.

I'm not certain what the performance would be over millions of lines, though. But the golden rule of performance is, don't guess, measure. You need to profile any solution to see if it provides the performance you need, and to determine whether any changes increase or decrease the performance, and by how much.

Here is a solution that sums a given field and prints the number of lines:

echo -n "Sum: "
cut -d, -f3 <file |paste -s -d+ |bc
echo -n "Lines: "
wc -l <file

Output:

Sum: 604720
Lines: 3

The field is specified by the -f# parameter to cut, here cut -f3.

Share:
7,540

Related videos on Youtube

Elifarley
Author by

Elifarley

Updated on September 18, 2022

Comments

  • Elifarley
    Elifarley over 1 year

    I have a CSV file (in which the field separator is indeed comma) with 8 columns and a few million rows. Here's a sample:

    1000024447,38111220,201705,181359,0,12,1,3090
    1064458324,38009543,201507,9,0,1,1,1298
    1064458324,38009543,201508,9,0,2,1,90017
    

    What's the fastest way to print the sum of all numbers in a given column, as well as the number of lines read? Can you explain what makes it faster?

    • Jeff Schaller
      Jeff Schaller almost 6 years
      Don't forget that sed can add, too! unix.stackexchange.com/questions/36949/addition-with-sed
    • Wildcard
      Wildcard almost 6 years
      Do you mean literally fastest as in performance? Or fastest to type on the fly, i.e. simplest? (Fastest for the programmer or fastest for the computer?) How long is your file?
    • Elifarley
      Elifarley almost 6 years
      Good question. I was thinking about how fast it can run. It's about a few million rows.
  • David Conrad
    David Conrad almost 6 years
    What is the numeric range of integers in Awk? If OP has "a few million rows," will it overflow?
  • RomanPerekhrest
    RomanPerekhrest almost 6 years
    To be fair, I would suggest to skip printf function call in the above awk timings as the other tools don't give any text formatted outputting (they can't). Just awk -F',' '{ sum += $3 } END{ print sum, NR }' file
  • smw
    smw almost 6 years
    @RomanPerekhrest updated
  • Sundeep
    Sundeep almost 6 years
    @DavidConrad gawk has gnu.org/software/gawk/manual/html_node/…, dunno about other versions
  • David Conrad
    David Conrad almost 6 years
    You might also want to try gawk with the -M option, based on Sundeep's comment on Roman's answer. This is an excellent answer. Thank you for introducing me to datamash.
  • Stéphane Chazelas
    Stéphane Chazelas almost 6 years
    With LC_ALL=C gawk gives comparable timings to mawk. mawk doesn't support multi-byte characters.