Calculate and divide by total with AWK

43,703

Solution 1

To create the table with a single call to awk:

$ awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' data data
foo     10      10%
bar     20      20%
oof     50      50%
rab     20      20%

How it works

The file data is provided as an argument to awk twice. Consequently, it will be read twice, the first time to get the total, which is stored in the variable s, and the second to print the output. Looking at the commands in more detail:

  • FNR==NR{s+=$2;next;}

    NR is the total number of records (lines) that awk has read and FNR is the number of records read so far from the current file. Consequently, when FNR==NR, we are reading the first file. When this happens, the variable s is incremented by the value in the second column. Then, next tells awk to skip the rest of the commands and start over with the next record.

    Note that it is not necessary to initialize s to zero. In awk, all numeric variables are, by default, initialized to zero.

  • printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s

    If we reach this command, then we are processing the second file. This means that s now holds the total of column 2. So, we print column 1, column 2, and the percentage, 100*$2/s.

Output format options

With printf, detailed control of the output format is possible. The command above uses the %s format specifier which works for strings, integers, and floats. Three other option that might be useful here are:

  • %d formats numbers as integers. If the number is actually floating point, it will be truncated to an integer

  • %f formats numbers as floating point. It is also possible to specify widths and decimals places as, for example, %5.2f.

  • %e provides exponential notation. This would be useful if some numbers were exceptionally large or small.

Make a shell function

If you are going to use this more than once, it is an inconvenience to type a long command. Instead create either a function or a script to hole the command.

To create a function called totals, run the command:

$ totals() { awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"; }

With this function defined, the percentages for a data file called data can be found by running:

$ totals data

To make the definition of totals permanent, place it in your ~/.bashrc file.

Make a shell script

If you prefer a script, create a file called totals.sh with the contents:

#!/bin/sh
awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"

To get the percentages for a data file called data, run:

sh totals.sh data

Solution 2

The "simple" way of doing this would be to call awk twice: once to get the total, another time to calculate the ratios.

$ total=$(awk 'BEGIN{ total=0 } { total=total+$2 } END{ printf total }' data)
$ awk -v total=$total '{ print $1 "\t" $2 "\t" 100*$2/total "%" }' data

Now I'm sure someone will come up with a one-liner somehow...

Solution 3

Awk way with one file open(for completeness)

awk '{a[NR]=$0;x+=(b[NR]=$2)}END{while(++i<=NR)print a[i]"\t"100*b[i]/x"%"}' file

foo     10      10%
bar     20      20%
oof     50      50%
rab     20      20%

This will use more memory than the others but should be faster

This reads the line into array a and field two into array b.
Then increments x by the value in field 2.

At the end it iterates from 1 to the number of records and outputs the correct line and calculates the percentage.

Share:
43,703

Related videos on Youtube

Rip Leeb
Author by

Rip Leeb

Updated on September 18, 2022

Comments

  • Rip Leeb
    Rip Leeb almost 2 years

    Given the following data file...

    foo     10
    bar     20
    oof     50
    rab     20
    

    ... how would I print column two as a percent of the total of column two? In other words, I want...

    foo     10    10%
    bar     20    20%
    oof     50    50%
    rab     20    20%
    

    ... with less obvious numbers of course. I can create a running total easily enough, but I'm not sure how I can calculate the total before printing the lines. I am doing this in an awk file totals.awk...

    #!/usr/bin/awk -f
    BEGIN{
            runningtotal=0
    }
    {
            runningtotal=runningtotal+$2
            print $1 "\t" $2 "\t" runningtotal "\t" $2/runningtotal
    }
    

    So, running ./totals.awk data yields...

    foo     10      10      1
    bar     20      30      0.666667
    oof     50      80      0.625
    rab     20      100     0.2
    

    Is there a way to loop twice, once to calculate the total, and once to print the lines? Is this possible in AWK, or must I use other utilities?

    • don_crissti
      don_crissti over 9 years
    • syntaxerror
      syntaxerror over 9 years
      BTW you need not run it as cat data | ./totals.awk. Consider that awk WANTS a file, so just do a ./totals.awk data (or if you insist, totals.awk < data, in this case it should not matter) instead. Anyways, the gist is: in any case, omit that superfluous cat.
  • syntaxerror
    syntaxerror over 9 years
    Just a +1 for not using cat. It always hurts my eyes to see cat "misused" like that. Poor cat! (meow) ;)
  • yorkshiredev
    yorkshiredev over 9 years
    Another alternative would be to use input redirection: this would even save awk the trouble of opening the file.
  • syntaxerror
    syntaxerror over 9 years
    Yes, but chances are the latter will not be entirely POSIX-compliant then. (From the top of my head though, would have to try it out to be sure, perhaps I'm again confusing it with command substitution ;))
  • αғsнιη
    αғsнιη over 9 years
    Isn't good to use %d for second and third columns instead of %s for them?!
  • John1024
    John1024 over 9 years
    @KasiyA Based on your suggestion, I added a section on printf formats. %s will show integer or floating point as the number requires. %d would, of course, truncate floating point numbers to integer, potentially losing accuracy. Whether that format is better than more precise output is, I think, a matter for the OP to decide.
  • syntaxerror
    syntaxerror over 9 years
    Thanks, very interesting stuff. But do not even think of using this in C! As awk's printf syntax is pretty alike to the C syntax, this looks sooo darn tempting. But C, in contrary, will show itself obdurate here once more (just had to test it! :)) warning: format ‘%s’ expects argument of type ‘char *’, but argument 2 has type ‘int’. Trying to run that program even resulted in a segfault...!---So what this post is supposed to say is: don't assume that what you can do in awk would be possible in C as well. awk is WAY more liberal in that respect.
  • Admin
    Admin over 9 years
    @John1024 you can also use the form file{,} for two files or add more commas for more, it saves space if filenames are large.
  • Rip Leeb
    Rip Leeb over 9 years
    Is there a way to parse the output stored in the associative array a inside of END? Say I only wanted to print the first column and the percentage.
  • Rip Leeb
    Rip Leeb over 9 years
    I don't need a one-liner - in fact I am hoping for a .awk file which I can call like ./totals.awk data. Is there a way to make this work without needing to input the file twice like ./totals data data? I appreciate the explanations btw.
  • John1024
    John1024 over 9 years
    @Nate I have added sections for making the code into (a) shell function or (b) a shell script. Both work as well. Both eliminate the need for write the file name twice. Use whichever approach you are more comfortable with.
  • Admin
    Admin over 9 years
    @Nate There is but i don't think it would be necessary, you could just save the first column instead of the line, a[NR]=$1 instead of a[NR]=$0