GROUP BY/SUM from shell

22,664

Solution 1

Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort -k1,1

The originally posted solution, based on old Unix sort options:

awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", key, arr[key])
   }' file \
   | sort +0n -1

I hope this helps.

Solution 2

No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:

#!/bin/bash
declare -A values
while read key value; do
  values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[@]}"; do
  printf "%s %s\n" "$key" "${values[$key]}"
done

...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):

#!/bin/bash
read cur_key cur_value
while read key value; do
  if [[ $key = "$cur_key" ]] ; then
    cur_value=$(( cur_value + value ))
  else
    printf "%s %s\n" "$cur_key" "$cur_value"
    cur_key="$key"
    cur_value="$value"
  fi
done
printf "%s %s\n" "$cur_key" "$cur_value"

Solution 3

This Perl one-liner seems to do the job:

perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile

Solution 4

This can be easily achieved with the following single-liner:

cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Or.

termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"

Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:

pip install --user https://github.com/tobimensch/termsql/archive/master.zip

Update

In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.

Solution 5

With GNU awk (versions less than 4):

WHINY_USERS= awk 'END {
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile

With GNU awk >= 4:

awk 'END {
  PROCINFO["sorted_in"] = "@ind_str_asc"
  for (E in a)
    print E, a[E]
    }
{ a[$1] += $2 }' infile
Share:
22,664
Legend
Author by

Legend

Just a simple guy :)

Updated on December 25, 2020

Comments

  • Legend
    Legend over 3 years

    I have a large file containing data like this:

    a 23
    b 8
    a 22
    b 1
    

    I want to be able to get this:

    a 45
    b 9
    

    I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?

  • EricR
    EricR about 12 years
    what exactly do those arguments do to sort? I don't see them in the man page and the invocation page has left me confused.
  • Mark Reed
    Mark Reed about 12 years
    Modern versions of sort prefer the -k syntax for specifying sort keys: sort -nk1,1 instead of sort +0n -1. But since the keys are letters, why are you specifying -n anyway?
  • Mark Reed
    Mark Reed about 12 years
    @EricR: +0n -1 is old-fashioned for -n -k1,1: sort numerically by the first (whitespace-separated) field.
  • Mark Reed
    Mark Reed about 12 years
    You could also let awk do the sorting: asorti(arr,keys); for (i in keys) { printf "%s\t%s\n", keys[i], arr[keys[i]]);
  • Charles Duffy
    Charles Duffy about 12 years
    @MarkReed you could, but GNU sort would be the more efficient approach if one wants to scale to really big inputs (particularly if those inputs will be larger than available memory; GNU sort will partition into temporary files in that case)
  • Mark Reed
    Mark Reed about 12 years
    @CharlesDuffy Point, but the associative array would be pretty huge at that point, too; I think awk would have trouble even without the sorting in that case.
  • Charles Duffy
    Charles Duffy about 12 years
    @MarkReed ...hence the second version given in my pure-bash answer, which doesn't store more than two lines at a time and thus scales to very large inputs nicely. :)
  • Mark Reed
    Mark Reed about 12 years
    Heck, with some minimal munging the above would work in vanilla Bourne, no bash required.while read key value; do if [ "$key" = "$cur_key" ]; then cur_value=`expr $cur_value + $value`; else echo "$cur_key $cur_value"; cur_key="$key"; cur_value="$value"; fi; done; echo "$cur_key $cur_value"
  • shellter
    shellter about 12 years
    Thanks folks, I was going to cover some of those points in an detailed answer. But 2 lines at at time, can't beat that! How is the performance? Good luck to all.
  • Mr. Llama
    Mr. Llama about 9 years
    I seem to have stumbled on this almost three years late. What exactly does the WHINY_USERS variable do?
  • Dimitre Radoulov
    Dimitre Radoulov about 9 years
    Sorts the array keys in asciibetical order.
  • Charles Duffy
    Charles Duffy about 8 years
    @MarkReed, definitely so, though the performance impact of the subshell running expr is sufficient that the POSIX sh $(( )) expansion would be better; while (( )) is a bash extension, $(( )) is standard-compliant; it's only pre-1991-POSIX-standard Bourne sh where expr is needed.
  • fedorqui
    fedorqui over 7 years
    You can make it a real one-liner saying awk '{sum[$1]+=$2} END {for (val in sum) print val, sum[val]}' <(sort file).
  • OrangeDog
    OrangeDog almost 7 years
    Link is dead, this one is more likely to live: stackoverflow.com/q/11697556/476716
  • SourceSeeker
    SourceSeeker over 5 years
    Another (Bash-specific, more concise) way to express the value addition in your first example: (( values["$key"] += value )) Note: the default-0 isn't necessary.
  • Petr
    Petr about 5 years
    I like the termsql tool! It can be handy!