How to get unique lines based on value of a column

6,700

Solution 1

With awk:

awk 'NR==FNR { a[$1]++ } NR!=FNR && a[$1]==1' file file

(the filename is passed twice).

Edit: If the file comes from stdin you need a temporary copy. Something like this:

tmp="$( mktemp -t "${0##*/}"_"$$"_.XXXXXXXX )" && \
    trap 'rm -f "$tmp"' 0 HUP INT QUIT TERM || exit 1
... | tee "$tmp" | awk '...' - "$tmp"

Solution 2

If you don't mind scrambling the order, then

sort <file> | uniq -uw 1

See man uniq for more information, but here are the important parts.

   -u, --unique
          only print unique lines
   -w, --check-chars=N
          compare no more than N characters in lines

Solution 3

If you'd like awk

awk '
    $1 in ARR{
        ARR[$1] = RS;
        next;
    }
    {
        ARR[$1] = $0;
    }
    END{
        for(i in ARR)
            if(ARR[i] != RS)
                print ARR[i];
    }
    ' file

The script put lines into array ARR with 1st field as an index and full line as a value. If array already have same index change value to «\n» (new line) sign. After the file ended prints those array's elements which value do not equal «\n»
Be informed that awk's RS variable is equal newline by default.

Or you can do it by sed

sort file |
sed '
    :a;
    $!N;
    s/\(\S\+\s\).*\n\1.*/\1\a/;
    ta;
    /\a/P;
    D;
    '

Solution 4

$ cut -d' ' -f1 <file | sort | uniq -d | sed 's/^/^/' | grep -v -f /dev/stdin file
B 17
D 344

This first picks out the duplicated entries in the first column of the file file by cutting the column out, sorting it and feeding it to uniq -d (which will only report duplicates).

It then prefixes each resulting line with ^ to create regular expressions that are anchored to the beginning of the line. The output of the sed command with the given data is

^A
^C

The final grep reads these regular expressions and picks out all lines from the file that does not match any of them. We get grep to read the patterns from sed by using -f /dev/stdin.

The result will have the same order as in the original file.

Solution 5

perl -lane '
   exists $h{$F[0]} and undef $h{$F[0]},next;

   ( $h{$F[0]}, $h[@h] ) = ( $_, $F[0] );

   END{ print $h{$_} for grep { defined $h{$_} } @h }
' yourfile

The operation of the code looks at if the 1st field has been encountered earlier, then the key by that name would exist in the hash, and so we go ahead and undef the value for this particular key, as there's no point in building up an array that will anyway be discarded at the end. Instead we carry the same information by a lesser memory imprint.

And in the scenario of seeing the 1st field the very 1st time, we populate the hash %h with the current line and simultaneously append the array @h with this key. We undertake this step to retain the order in which the keys were encountered. If we don't care for the order, then we can very well do away with this step.

Finally, when all input has been digested, in the end END block, loop over the elements of the array @h and from those fish out only those for whom the hash %h has defined values. Remember, undef values means must have seen more than once.

Share:
6,700

Related videos on Youtube

Michael
Author by

Michael

Updated on September 18, 2022

Comments

  • Michael
    Michael over 1 year

    Following input:

    A 13
    A 12
    B 17
    C 33
    D 344
    C 24
    A 5
    C 99
    

    I want to get only the lines where column one is unique:

    B 17
    D 344
    

    A solution with awk would be nice, but something else is acceptable as well.

  • Tigger
    Tigger about 7 years
    Forgot about -w for uniq. This would be the way to go.
  • Sparhawk
    Sparhawk about 7 years
    @Tigger One possible caveat is if the first field is not always the same number of characters long. Then it might need something more sophisticated. It's also odd that uniq has -s and -w, but only -f, and nothing analogous to only check the first N fields.
  • Sparhawk
    Sparhawk about 7 years
    Nice (+1)! An explanation would be great though.
  • Michael
    Michael about 7 years
    This seems to be the solution I go for. Follow up question: How do I give stdout from proceeding command two times to awk? (like inputting two tims the same file)
  • Admin
    Admin about 7 years
    There's a slight problem if we had a line with the 2nd field as "r" and this line appeared only once in the file. Then this will be discarded by awk code, but it shouldn't per spec.
  • Satō Katsura
    Satō Katsura about 7 years
    @Michael I updated my answer.
  • Kusalananda
    Kusalananda about 7 years
    GNU coreutils' uniq has -w. Others do not.
  • Satō Katsura
    Satō Katsura about 7 years
    For what it's worth: GNU grep accepts -f - instead of -f /dev/stdin.
  • Kusalananda
    Kusalananda about 7 years
    @SatoKatsura It may do, but my grep doesn't.
  • Satō Katsura
    Satō Katsura about 7 years
    Yup, BSD grep doesn't.
  • Costas
    Costas about 7 years
    @RakeshSharma Have edited to avoid noted