Shell script to sort by date column using Linux command 'sort'

shell-script awk sort

5,915

$ sort -k 3,3r session.log | awk '!seen[$2]++ { print }'
9 u1 2018-08-12 pass
6 u4 2018-07-11 pass
2 u2 2018-06-15 pass
3 u3 2018-05-18 pass

(The { print } can be removed completely, I'm only including it to show what happens. The default action is to print the entire input record if the condition is true.)

This sorts the file that you have, by decreasing dates (the most recent date first). The awk program reads the sorted data and prints out the first entry found for each user. The seen variable is an associative array/hash that uses the usernames as keys. If its value is zero for a user, then that user has not yet been seen, and so the line from the file is printed.

Your code with my annotations:

# get list of unique usernames from log:
( awk {print} session.log | cut -f2 -d' ' | sort | uniq ) > store.txt

# shorter equivalent:
# cut -d ' ' -f2 <session.log | sort -u >store.txt

# loop over the usernames:
for line in $(cat "store.txt")
do
    # get entries related to this user:
    grep "$line" session.log > "$line.txt"

    # echo the last entry:
    echo $(sort -k3 "$line.txt" | awk 'END{print}')

    # shorter equivalent of both of the above commands:
    # awk -v user="$line" '$2 == user { print }' session.log | sort -k3,3 | tail -n 1
done
rm -f store.txt

So an alternative approach based on your shell loop:

cut -d ' ' -f2 <session.log | sort -u |
while read username; do
    awk -v user="$username" '$2 == user { print }' session.log | sort -k 3,3 | tail -n 1
done

Again, the { print } may be left out from the awk script above.

This still sorts subsets of session log once for each user, which is kind of wasteful.

Pre-sorting the log by decreasing dates:

sort -k3,3r -o sorted.log session.log

cut -d ' ' -f2 <session.log | sort -u |
while read username; do
    awk -v user="$username" '$2 == user { print; exit }' sorted.log
done

rm sorted.log

This still requires us to scan the log from the top once per user. The natural improvement is to allow awk to keep track of what users have been seen and what users have not been seen, which gives you my answer at the top.

5,915

Dipankar

Updated on September 18, 2022

Comments

Dipankar over 1 year

I have a file called session.log It contains some text data. 1st column contains serial number. 2nd column contains username. 3rd column contains last logged in date. 4th column contains result.

One user logged in multiple times. I need to find the last logged in date for each unique user. So I wrote a shell script. Output will display the lines with latest logged in date (3rd column) for each unique user (2nd column). Output should not contain same username multiple times.

$ cat session.log 
1 u1 2018-05-19 pass
2 u2 2018-06-15 pass
3 u3 2018-05-18 pass
4 u4 2018-05-17 pass
5 u2 2018-05-14 pass
6 u4 2018-07-11 pass
7 u1 2018-05-16 pass
8 u3 2018-05-13 pass
9 u1 2018-08-12 pass
10 u1 2018-05-10 pass

What I tried:

( awk {print} session.log | cut -f2 -d' ' | sort | uniq ) > store.txt
for line in $(cat "store.txt")
do
    touch "$line.txt"
    grep "$line" session.log > "$line.txt"
    temp=$(sort -k3 "$line.txt" | awk 'END{print}')
    echo $temp > "$line.txt"
    cat "$line.txt"
done

Output

$ ./sort.sh
9 u1 2018-08-12 pass
2 u2 2018-06-15 pass
3 u3 2018-05-18 pass
6 u4 2018-07-11 pass

The shell script works for the date format (yyyy-mm-dd) and (yyyy/mm/dd). Is there any better code to do the same work? Using awk how do we do this?

EDIT:

$ cat sort.sh
( awk {print} session.log | cut -f2 -d' ' | sort | uniq ) > store.txt
for line in $(cat "store.txt")
do
    #touch "$line.txt"
    grep "$line" session.log > "$line.txt"
    echo $(sort -k3 "$line.txt" | awk 'END{print}')
    #temp=$(sort -k3 "$line.txt" | awk 'END{print}')
    #echo $temp > "$line.txt"
    #cat "$line.txt"
done
rm -f store.txt