How can I sort within an awk script on Linux?
Solution 1
GNU awk gives you a neat way to control how you traverse over an array: see Controlling Array Traversal and Controlling Scanning
gawk -F', ' '
{fruit[$1] = $2}
END {
OFS = FS
printf "\nordered by fruit name\n"
PROCINFO["sorted_in"] = "@ind_str_asc"
for (f in fruit) print f, fruit[f]
printf "\nordered by number\n"
PROCINFO["sorted_in"] = "@val_num_desc"
for (f in fruit) print f, fruit[f]
}
' fruit
outputs
ordered by fruit name
Apples, 12
Cheries, 7
Oranges, 2
Pears, 50
Strawberries, 36
ordered by number
Pears, 50
Strawberries, 36
Apples, 12
Cheries, 7
Oranges, 2
Solution 2
You can actually pass awk's print
through "sort"
(note the quotes):
$ awk '{print "Fruit",NR, $0 | "sort -k 2 -t, -rn"}' fruit
Fruit 2 Pears, 50
Fruit 4 Strawberries, 36
Fruit 1 Apples, 12
Fruit 3 Cheries, 7
Fruit 5 Oranges, 2
So, to write to numbers
, you can do:
awk '{print "Fruit",NR, $0 | "sort -k 2 -t, -rn > numbers"}' fruit
Note that I simplified your awk a bit. There's no need to use printf
here or to explicitly print OFS
since you aren't changing it anywhere. I also don't see what your for(i=1;i<=NF;i++)j+=$i
is doing. You already have the number with NR
and your printf
wasn't using j
anyway.
Solution 3
I must have had a serious problem with SunOS nawk in 2002. I found my test script that contained three awk implementations that run within non-GNU awk:
(a) eSort: uses a workfile and reads back through a pipe running sort command. Not good in my case, because I was doing stuff through ssh for agentless monitoring, and external work files were too invasive for our live servers.
(b) qSort: a recursive partition sort. Performance bad for large data, and breaks the stack in mawk for > 2000 elements. Fun to write though.
(c) hSort: a sort-in-situ algorithm in 15 lines. This heap uses an indexing algorithm to support a binary tree (see Wikipedia).
This bash script contains awk functions hSort and hUp which implement the actual sort. One action line puts all the input into an array, and the END block calls hSort and reports the results.
The input data is the contents of "man bash", once as lines, and again as words. We use wc to prove nothing got lost, and sort -c to prove the output is sorted. The timings include the read and print overhead.
This is the test shot:
Paul--) ./hSort
Sorted 5251 elements.
real 0m0.120s
user 0m0.116s
sys 0m0.004s
5251 44463 273728 hSort.raw
sort: hSort.raw:2: disorder:
5251 44463 273728 hSort.srt
Sorted 44463 elements.
real 0m1.336s
user 0m1.316s
sys 0m0.008s
44463 44463 265333 hSort.raw
sort: hSort.raw:3: disorder: Commands
44463 44463 265333 hSort.srt
This is the script. Enjoy!
#! /bin/bash
export LC_ALL="C"
#### Heapsort algorithm.
function hSort { #:: (void) < text
local AWK='''
#.. Construct the heap, then unfold it.
function hSort (A, Local, n, j, e) {
for (j in A) ++n;
for (j = int (n / 2); j > 0; --j) hUp( j, A[j], n, A);
for (j = n; j > 1; --j) { e = A[j]; A[j] = A[1]; hUp( 1, e, j - 1, A); }
return (0 + n);
}
#.. Given an empty slot and its contents, pull any bigger elements up the tree.
function hUp (j, e, n, V, Local, k) {
while ((k = j + j) <= n) {
if (k + 1 <= n && STX V[k] < STX V[k + 1]) ++k;
if (STX e >= STX V[k]) break;
V[j] = V[k]; j = k;
}
V[j] = e;
}
{ U[++nU] = $0; }
END {
sz = hSort( U);
printf ("\nSorted %s elements.\n", sz) | "cat 1>&2";
for (k = 1; k in U; ++k) print U[k];
}
'''
mawk -f <( printf '%s\n' "${AWK}" )
}
#### Test Package Starts Here.
function Test {
time hSort < hSort.raw > hSort.srt
for fn in hSort.{raw,srt}; do wc "${fn}"; LC_ALL="C" sort -c "${fn}"; done
}
AWK_LINE='{ sub (/^[ \011]+/, ""); print; }'
AWK_WORD='{ for (f = 1; f <= NF; ++f) print $(f); }'
#xxx : > hSort.raw; Test #.. Edge cases.
#xxx echo "Hello" > hSort.raw; Test
#xxx { echo "World"; echo "Hello"; } > hSort.raw; Test
man bash | col -b | mawk "${AWK_LINE}" > hSort.raw; Test
man bash | col -b | mawk "${AWK_WORD}" > hSort.raw; Test
Related videos on Youtube
user25
Updated on September 18, 2022Comments
-
user25 over 1 year
I have file
fruit
that has the following content:Apples, 12 Pears, 50 Cheries, 7 Strawberries, 36 Oranges, 2
I would like to sort the numerical data of the file:
for(i=1;i<=NF;i++)j+=$i;printf "Fruit %d%s, %d\n",NR,OFS,$1,j | sort -k 2 > "numbers"; j=0"
In order to run the awk script I run the command:
awk -f numbers fruit
The numbers file has the same content as fruit but its 1st and 2nd field are copied to the numbers file.
-
Ed Morton over 4 yearsInstead of calling sort inside awk it's simpler and more efficient to simply print in awk and pipe the awk output to sort:
awk '{print ...}' fruit | sort ...
. -
terdon over 4 years@EdMorton oh, absolutely! I would never use this approach myself, what's the point? But this is what the OP asked for.
-
Paul_Pedant over 4 yearsI often find a requirement to sort within gawk, when I don't want to sort the whole output. For example, collecting and reporting stats separately for each input file. I can use a decorate/sort/clip method to tailormake simple keys from complex data (e.g. rank electrical equipment overloads using a side array of max ratings). Also, external sort uses disk workfiles, and a split/merge strategy. Internal sort can use better methods.
-
Paul_Pedant over 4 yearsOh, I didn't post it. I asserted its existence, and left it as an exercise for the reader.
-
Paul_Pedant over 4 yearsPosted the code and test on Jan 9th, 2020
-
Joe Skora over 3 years@EdMorton @terdon I use this awk sort to sort body lines while leaving the header alone. `echo -e "HEADER\nline3\nline1\nline2" | awk 'NR<=1 {print} NR > 1 {print | "sort" }'
-
terdon over 3 years@JoeSkora isn't it easier to use a subshell?
(printf 'HEADER'; printf '\nline3\nline1\nline2\n' | sort ) > file
. Or, when sorting a file:( head -n1 file; tail -n+2 file | sort) > newfile
. -
Ed Morton over 3 years@JoeSkora you don't need to spawn a subshell from awk and then hope that the buffering from all concerned leads to the output from the subshell getting to stdout after the rest of the output from the awk command instead of before it or, if applicable, in the middle of it. Just do
awk '{print (NR>1), $0}' | sort -k1,1n -k2 | cut -d' ' -f2-
-
Joe Skora over 3 years@terdon I often use this across multiple SSH hops, so it has to be a single command stream.
-
Joe Skora over 3 years@EdMorton I like printing the conditional, great idea. The last part can be simplified further leaving this.
awk '{print (NR>1),$0}' | sort ... | cut -c3-
. -
Ed Morton over 3 years@JoeSkora that's true, there's just usually other fields to operate on too, e.g. also printing
NR
so you can retain original order when there's duplicate key values and you don't have GNU sort for-s
so it's usually more likeawk '{print (NR>1), NR, $0}' | sort -k1,1n -k3 -k2,2n | cut -d' ' -f3-
-
Nabheet about 2 yearsI know that the documentation says this but this doesn't seem to be working on my CentOS6 (because reasons) server. And also on my MacBook (which I assume doesn't have GNU awk).
-
Angel Todorov about 2 yearsMacOS does not ship with GNU awk, but you can install it easily with Homebrew. For CentOS, what version is installed (
gawk --version
)? -
Nabheet about 2 years
# gawk --version GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation.
I assume these cool features are in gawk version 4? -
Angel Todorov about 2 yearsYeah, looking at the NOTES in git.savannah.gnu.org/cgit/gawk.git/tree it looks like this feature was introduced in v4.0. I guess you have to call out to
sort