How can I sort within an awk script on Linux?

shell text-processing awk sort

10,157

Solution 1

GNU awk gives you a neat way to control how you traverse over an array: see Controlling Array Traversal and Controlling Scanning

gawk -F', ' '
    {fruit[$1] = $2}
    END {
        OFS = FS

        printf "\nordered by fruit name\n"
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (f in fruit) print f, fruit[f]

        printf "\nordered by number\n"
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (f in fruit) print f, fruit[f]
    }
' fruit

outputs

ordered by fruit name
Apples, 12
Cheries, 7
Oranges, 2
Pears, 50
Strawberries, 36

ordered by number
Pears, 50
Strawberries, 36
Apples, 12
Cheries, 7
Oranges, 2

Solution 2

You can actually pass awk's print through "sort" (note the quotes):

$ awk '{print "Fruit",NR, $0 | "sort -k 2 -t, -rn"}' fruit 
Fruit 2 Pears, 50
Fruit 4 Strawberries, 36
Fruit 1 Apples, 12
Fruit 3 Cheries, 7
Fruit 5 Oranges, 2

So, to write to numbers, you can do:

awk '{print "Fruit",NR, $0 | "sort -k 2 -t, -rn > numbers"}' fruit

Note that I simplified your awk a bit. There's no need to use printf here or to explicitly print OFS since you aren't changing it anywhere. I also don't see what your for(i=1;i<=NF;i++)j+=$i is doing. You already have the number with NR and your printf wasn't using j anyway.

Solution 3

I must have had a serious problem with SunOS nawk in 2002. I found my test script that contained three awk implementations that run within non-GNU awk:

(a) eSort: uses a workfile and reads back through a pipe running sort command. Not good in my case, because I was doing stuff through ssh for agentless monitoring, and external work files were too invasive for our live servers.

(b) qSort: a recursive partition sort. Performance bad for large data, and breaks the stack in mawk for > 2000 elements. Fun to write though.

(c) hSort: a sort-in-situ algorithm in 15 lines. This heap uses an indexing algorithm to support a binary tree (see Wikipedia).

This bash script contains awk functions hSort and hUp which implement the actual sort. One action line puts all the input into an array, and the END block calls hSort and reports the results.

The input data is the contents of "man bash", once as lines, and again as words. We use wc to prove nothing got lost, and sort -c to prove the output is sorted. The timings include the read and print overhead.

This is the test shot:

Paul--) ./hSort

Sorted 5251 elements.

real    0m0.120s
user    0m0.116s
sys     0m0.004s
  5251  44463 273728 hSort.raw
sort: hSort.raw:2: disorder: 
  5251  44463 273728 hSort.srt

Sorted 44463 elements.

real    0m1.336s
user    0m1.316s
sys     0m0.008s
 44463  44463 265333 hSort.raw
sort: hSort.raw:3: disorder: Commands
 44463  44463 265333 hSort.srt

This is the script. Enjoy!

#! /bin/bash

export LC_ALL="C"

#### Heapsort algorithm.

function hSort {    #:: (void) < text

    local AWK='''
#.. Construct the heap, then unfold it.
function hSort (A, Local, n, j, e) {
    for (j in A) ++n;
    for (j = int (n / 2); j > 0; --j) hUp( j, A[j], n, A);
    for (j = n; j > 1; --j) { e = A[j]; A[j] = A[1]; hUp( 1, e, j - 1, A); }
    return (0 + n);
}
#.. Given an empty slot and its contents, pull any bigger elements up the tree.
function hUp (j, e, n, V, Local, k) {
    while ((k = j + j) <= n) {
        if (k + 1 <= n  &&  STX V[k] < STX V[k + 1]) ++k;
        if (STX e >= STX V[k]) break;
        V[j] = V[k]; j = k;
    }
    V[j] = e;
}
{ U[++nU] = $0; }
END {
    sz = hSort( U);
    printf ("\nSorted %s elements.\n", sz) | "cat 1>&2";
    for (k = 1; k in U; ++k) print U[k];
}
'''
    mawk -f <( printf '%s\n' "${AWK}" )
}

#### Test Package Starts Here.

function Test {
    time hSort < hSort.raw > hSort.srt
    for fn in hSort.{raw,srt}; do wc "${fn}"; LC_ALL="C" sort -c "${fn}"; done
}
    AWK_LINE='{ sub (/^[ \011]+/, ""); print; }'
    AWK_WORD='{ for (f = 1; f <= NF; ++f) print $(f); }'

    #xxx : > hSort.raw; Test        #.. Edge cases.
    #xxx echo "Hello" > hSort.raw; Test
    #xxx { echo "World"; echo "Hello"; } > hSort.raw; Test

    man bash | col -b | mawk "${AWK_LINE}" > hSort.raw; Test
    man bash | col -b | mawk "${AWK_WORD}" > hSort.raw; Test

10,157

user25

Updated on September 18, 2022

Comments

user25 over 1 year
I have file fruit that has the following content:
```
Apples, 12
Pears, 50
Cheries, 7
Strawberries, 36
Oranges, 2
```
I would like to sort the numerical data of the file:
```
for(i=1;i<=NF;i++)j+=$i;printf "Fruit %d%s, %d\n",NR,OFS,$1,j | sort -k 2 > "numbers"; j=0"
```
In order to run the awk script I run the command:
```
awk -f numbers fruit
```
The numbers file has the same content as fruit but its 1st and 2nd field are copied to the numbers file.
Ed Morton over 4 years

Instead of calling sort inside awk it's simpler and more efficient to simply print in awk and pipe the awk output to sort: awk '{print ...}' fruit | sort ....
terdon over 4 years

@EdMorton oh, absolutely! I would never use this approach myself, what's the point? But this is what the OP asked for.
Paul_Pedant over 4 years

I often find a requirement to sort within gawk, when I don't want to sort the whole output. For example, collecting and reporting stats separately for each input file. I can use a decorate/sort/clip method to tailormake simple keys from complex data (e.g. rank electrical equipment overloads using a side array of max ratings). Also, external sort uses disk workfiles, and a split/merge strategy. Internal sort can use better methods.
Paul_Pedant over 4 years

Oh, I didn't post it. I asserted its existence, and left it as an exercise for the reader.
Paul_Pedant over 4 years

Posted the code and test on Jan 9th, 2020
Joe Skora over 3 years

@EdMorton @terdon I use this awk sort to sort body lines while leaving the header alone. `echo -e "HEADER\nline3\nline1\nline2" | awk 'NR<=1 {print} NR > 1 {print | "sort" }'
terdon over 3 years

@JoeSkora isn't it easier to use a subshell? (printf 'HEADER'; printf '\nline3\nline1\nline2\n' | sort ) > file. Or, when sorting a file: ( head -n1 file; tail -n+2 file | sort) > newfile.
Ed Morton over 3 years

@JoeSkora you don't need to spawn a subshell from awk and then hope that the buffering from all concerned leads to the output from the subshell getting to stdout after the rest of the output from the awk command instead of before it or, if applicable, in the middle of it. Just do awk '{print (NR>1), $0}' | sort -k1,1n -k2 | cut -d' ' -f2-
Joe Skora over 3 years

@terdon I often use this across multiple SSH hops, so it has to be a single command stream.
Joe Skora over 3 years

@EdMorton I like printing the conditional, great idea. The last part can be simplified further leaving this. awk '{print (NR>1),$0}' | sort ... | cut -c3-.
Ed Morton over 3 years

@JoeSkora that's true, there's just usually other fields to operate on too, e.g. also printing NR so you can retain original order when there's duplicate key values and you don't have GNU sort for -s so it's usually more like awk '{print (NR>1), NR, $0}' | sort -k1,1n -k3 -k2,2n | cut -d' ' -f3-
Nabheet about 2 years

I know that the documentation says this but this doesn't seem to be working on my CentOS6 (because reasons) server. And also on my MacBook (which I assume doesn't have GNU awk).
Angel Todorov about 2 years

MacOS does not ship with GNU awk, but you can install it easily with Homebrew. For CentOS, what version is installed (gawk --version)?
Nabheet about 2 years

# gawk --version GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation. I assume these cool features are in gawk version 4?
Angel Todorov about 2 years

Yeah, looking at the NOTES in git.savannah.gnu.org/cgit/gawk.git/tree it looks like this feature was introduced in v4.0. I guess you have to call out to sort