awk determine unique rows based on subset of columns

9,328

Solution 1

To print a certain column, let's say column 1 and 4, simply run:

awk -F',' '{print $1, $4}' file

The command here is to ask awk to print the designated column separated by the ','

You can change the column number, add / remove the $NUM to get different columns.

Combined with sort, you will get the unique rows:

awk -F',' '{print $1, $4}' file | sort -u

The -u flag removes duplicate items from the output.

Solution 2

You can do this by combining the column-values in the hash key, e.g. assuming your input is sorted, this one-pass solution works for column 1-3:

awk '!h[$1,$2,$3]++ { NF--; print }' FS=, OFS=, data.csv

Output:

Col1,Col2,Col3
A,10,50
A,10,05
B,20,30
B,20,03
C,30,100
C,30,111
C,40,111
C,30,123

For columns 1 and 4, do something like this:

awk '!h[$1,$4]++ { print $1, $4 }' FS=, OFS=, data.csv

Output:

Col1,Col4
A,2017
B,2017
C,2017
C,2016
C,2015
Share:
9,328

Related videos on Youtube

conor
Author by

conor

Updated on September 18, 2022

Comments

  • conor
    conor over 1 year

    I'm looking to return rows in a csv that are unique for a specified subset of columns.

    Example data:

    Col1,Col2,Col3,Col4
    A,10,50,2017
    A,10,05,2017
    B,20,30,2017
    B,20,03,2017
    C,30,100,2017
    C,30,111,2017
    C,30,100,2016
    C,40,111,2016
    C,30,123,2015
    

    This command outputs the unique values of a single column (column 1 in this case):

    awk -F , '{ a[$1]++ } END { for (b in a) { print b } }' file
    

    returns

    Col1
    A
    B
    C
    

    I'd like a command to do this for 2 columns or n columns.

    Desired output for unique rows using 2 columns (Col1 and Col3):

    Col1,Col3
    A,50
    A,05
    B,30
    B,03
    C,100
    C,111
    C,123
    

    For Col1 and Col4

    Col1,Col4
    A,2017
    B,2017
    C,2017
    C,2016
    C,2015
    

    For 3 columns Col1, Col2, and Col3

    Col1,Col2,Col3
    A,10,50
    A,10,05
    B,20,30
    B,20,03
    C,30,100
    C,30,111
    C,40,111
    C,30,123
    
  • conor
    conor almost 7 years
    Works as requested and can be modified for n columns, e.g. awk -F',' '{print $1, $3, $4}' file | sort |uniq. Upvote cast but not displaying due to my low rep.
  • grooveplex
    grooveplex almost 7 years
    I believe you can also use sort -u instead of sort | uniq
  • P.-H. Lin
    P.-H. Lin almost 7 years
    @grooveplex oh right! Thanks for the reminder!