GROUP BY/SUM from shell
Solution 1
Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort -k1,1
The originally posted solution, based on old Unix sort
options:
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort +0n -1
I hope this helps.
Solution 2
No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:
#!/bin/bash
declare -A values
while read key value; do
values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[@]}"; do
printf "%s %s\n" "$key" "${values[$key]}"
done
...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):
#!/bin/bash
read cur_key cur_value
while read key value; do
if [[ $key = "$cur_key" ]] ; then
cur_value=$(( cur_value + value ))
else
printf "%s %s\n" "$cur_key" "$cur_value"
cur_key="$key"
cur_value="$value"
fi
done
printf "%s %s\n" "$cur_key" "$cur_value"
Solution 3
This Perl one-liner seems to do the job:
perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile
Solution 4
This can be easily achieved with the following single-liner:
cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Or.
termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py
is a little broken), like:
pip install --user https://github.com/tobimensch/termsql/archive/master.zip
Update
In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql
can be used.
Solution 5
With GNU awk (versions less than 4):
WHINY_USERS= awk 'END {
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With GNU awk >= 4:
awk 'END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
Comments
-
Legend over 3 years
I have a large file containing data like this:
a 23 b 8 a 22 b 1
I want to be able to get this:
a 45 b 9
I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?
-
EricR about 12 yearswhat exactly do those arguments do to sort? I don't see them in the man page and the invocation page has left me confused.
-
Mark Reed about 12 yearsModern versions of sort prefer the
-k
syntax for specifying sort keys:sort -nk1,1
instead ofsort +0n -1
. But since the keys are letters, why are you specifying-n
anyway? -
Mark Reed about 12 years@EricR:
+0n -1
is old-fashioned for-n -k1,1
: sort numerically by the first (whitespace-separated) field. -
Mark Reed about 12 yearsYou could also let awk do the sorting:
asorti(arr,keys); for (i in keys) { printf "%s\t%s\n", keys[i], arr[keys[i]]);
-
Charles Duffy about 12 years@MarkReed you could, but GNU sort would be the more efficient approach if one wants to scale to really big inputs (particularly if those inputs will be larger than available memory; GNU sort will partition into temporary files in that case)
-
Mark Reed about 12 years@CharlesDuffy Point, but the associative array would be pretty huge at that point, too; I think awk would have trouble even without the sorting in that case.
-
Charles Duffy about 12 years@MarkReed ...hence the second version given in my pure-bash answer, which doesn't store more than two lines at a time and thus scales to very large inputs nicely. :)
-
Mark Reed about 12 yearsHeck, with some minimal munging the above would work in vanilla Bourne, no bash required.
while read key value; do if [ "$key" = "$cur_key" ]; then cur_value=`expr $cur_value + $value`; else echo "$cur_key $cur_value"; cur_key="$key"; cur_value="$value"; fi; done; echo "$cur_key $cur_value"
-
shellter about 12 yearsThanks folks, I was going to cover some of those points in an detailed answer. But 2 lines at at time, can't beat that! How is the performance? Good luck to all.
-
Mr. Llama about 9 yearsI seem to have stumbled on this almost three years late. What exactly does the
WHINY_USERS
variable do? -
Dimitre Radoulov about 9 yearsSorts the array keys in asciibetical order.
-
Charles Duffy about 8 years@MarkReed, definitely so, though the performance impact of the subshell running
expr
is sufficient that the POSIX sh$(( ))
expansion would be better; while(( ))
is a bash extension,$(( ))
is standard-compliant; it's only pre-1991-POSIX-standard Bourne sh whereexpr
is needed. -
fedorqui over 7 yearsYou can make it a real one-liner saying
awk '{sum[$1]+=$2} END {for (val in sum) print val, sum[val]}' <(sort file)
. -
OrangeDog almost 7 yearsLink is dead, this one is more likely to live: stackoverflow.com/q/11697556/476716
-
SourceSeeker over 5 yearsAnother (Bash-specific, more concise) way to express the value addition in your first example:
(( values["$key"] += value ))
Note: the default-0 isn't necessary. -
Petr about 5 yearsI like the termsql tool! It can be handy!