How to get unique lines based on value of a column

text-processing command-line awk

6,700

Solution 1

With awk:

awk 'NR==FNR { a[$1]++ } NR!=FNR && a[$1]==1' file file

(the filename is passed twice).

Edit: If the file comes from stdin you need a temporary copy. Something like this:

tmp="$( mktemp -t "${0##*/}"_"$$"_.XXXXXXXX )" && \
    trap 'rm -f "$tmp"' 0 HUP INT QUIT TERM || exit 1
... | tee "$tmp" | awk '...' - "$tmp"

Solution 2

If you don't mind scrambling the order, then

sort <file> | uniq -uw 1

See man uniq for more information, but here are the important parts.

   -u, --unique
          only print unique lines
   -w, --check-chars=N
          compare no more than N characters in lines

Solution 3

If you'd like awk

awk '
    $1 in ARR{
        ARR[$1] = RS;
        next;
    }
    {
        ARR[$1] = $0;
    }
    END{
        for(i in ARR)
            if(ARR[i] != RS)
                print ARR[i];
    }
    ' file

The script put lines into array ARR with 1st field as an index and full line as a value. If array already have same index change value to «\n» (new line) sign. After the file ended prints those array's elements which value do not equal «\n»
Be informed that awk's RS variable is equal newline by default.

Or you can do it by sed

sort file |
sed '
    :a;
    $!N;
    s/\(\S\+\s\).*\n\1.*/\1\a/;
    ta;
    /\a/P;
    D;
    '

Solution 4

$ cut -d' ' -f1 <file | sort | uniq -d | sed 's/^/^/' | grep -v -f /dev/stdin file
B 17
D 344

This first picks out the duplicated entries in the first column of the file file by cutting the column out, sorting it and feeding it to uniq -d (which will only report duplicates).

It then prefixes each resulting line with ^ to create regular expressions that are anchored to the beginning of the line. The output of the sed command with the given data is

^A
^C

The final grep reads these regular expressions and picks out all lines from the file that does not match any of them. We get grep to read the patterns from sed by using -f /dev/stdin.

The result will have the same order as in the original file.

Solution 5

perl -lane '
   exists $h{$F[0]} and undef $h{$F[0]},next;

   ( $h{$F[0]}, $h[@h] ) = ( $_, $F[0] );

   END{ print $h{$_} for grep { defined $h{$_} } @h }
' yourfile

The operation of the code looks at if the 1st field has been encountered earlier, then the key by that name would exist in the hash, and so we go ahead and undef the value for this particular key, as there's no point in building up an array that will anyway be discarded at the end. Instead we carry the same information by a lesser memory imprint.

And in the scenario of seeing the 1st field the very 1st time, we populate the hash %h with the current line and simultaneously append the array @h with this key. We undertake this step to retain the order in which the keys were encountered. If we don't care for the order, then we can very well do away with this step.

Finally, when all input has been digested, in the end END block, loop over the elements of the array @h and from those fish out only those for whom the hash %h has defined values. Remember, undef values means must have seen more than once.

View more solutions

6,700

Michael

Updated on September 18, 2022

Comments

Michael over 1 year
Following input:
```
A 13
A 12
B 17
C 33
D 344
C 24
A 5
C 99
```
I want to get only the lines where column one is unique:
```
B 17
D 344
```
A solution with awk would be nice, but something else is acceptable as well.
Tigger about 7 years

Forgot about -w for uniq. This would be the way to go.
Sparhawk about 7 years

@Tigger One possible caveat is if the first field is not always the same number of characters long. Then it might need something more sophisticated. It's also odd that uniq has -s and -w, but only -f, and nothing analogous to only check the first N fields.
Sparhawk about 7 years

Nice (+1)! An explanation would be great though.
Michael about 7 years

This seems to be the solution I go for. Follow up question: How do I give stdout from proceeding command two times to awk? (like inputting two tims the same file)
Admin about 7 years

There's a slight problem if we had a line with the 2nd field as "r" and this line appeared only once in the file. Then this will be discarded by awk code, but it shouldn't per spec.
Satō Katsura about 7 years

@Michael I updated my answer.
Kusalananda about 7 years

GNU coreutils' uniq has -w. Others do not.
Satō Katsura about 7 years

For what it's worth: GNU grep accepts -f - instead of -f /dev/stdin.
Kusalananda about 7 years

@SatoKatsura It may do, but my grep doesn't.
Satō Katsura about 7 years

Yup, BSD grep doesn't.
Costas about 7 years

@RakeshSharma Have edited to avoid noted