How to get unique lines based on value of a column
Solution 1
With awk
:
awk 'NR==FNR { a[$1]++ } NR!=FNR && a[$1]==1' file file
(the filename is passed twice).
Edit: If the file comes from stdin
you need a temporary copy. Something like this:
tmp="$( mktemp -t "${0##*/}"_"$$"_.XXXXXXXX )" && \
trap 'rm -f "$tmp"' 0 HUP INT QUIT TERM || exit 1
... | tee "$tmp" | awk '...' - "$tmp"
Solution 2
If you don't mind scrambling the order, then
sort <file> | uniq -uw 1
See man uniq
for more information, but here are the important parts.
-u, --unique
only print unique lines
-w, --check-chars=N
compare no more than N characters in lines
Solution 3
If you'd like awk
awk '
$1 in ARR{
ARR[$1] = RS;
next;
}
{
ARR[$1] = $0;
}
END{
for(i in ARR)
if(ARR[i] != RS)
print ARR[i];
}
' file
The script put lines into array ARR with 1st field as an index and full line as a value. If array already have same index change value to «\n» (new line) sign. After the file ended prints those array's elements which value do not equal «\n»
Be informed that awk's RS
variable is equal newline
by default.
Or you can do it by sed
sort file |
sed '
:a;
$!N;
s/\(\S\+\s\).*\n\1.*/\1\a/;
ta;
/\a/P;
D;
'
Solution 4
$ cut -d' ' -f1 <file | sort | uniq -d | sed 's/^/^/' | grep -v -f /dev/stdin file
B 17
D 344
This first picks out the duplicated entries in the first column of the file file
by cutting the column out, sorting it and feeding it to uniq -d
(which will only report duplicates).
It then prefixes each resulting line with ^
to create regular expressions that are anchored to the beginning of the line. The output of the sed
command with the given data is
^A
^C
The final grep
reads these regular expressions and picks out all lines from the file that does not match any of them. We get grep
to read the patterns from sed
by using -f /dev/stdin
.
The result will have the same order as in the original file.
Solution 5
perl -lane '
exists $h{$F[0]} and undef $h{$F[0]},next;
( $h{$F[0]}, $h[@h] ) = ( $_, $F[0] );
END{ print $h{$_} for grep { defined $h{$_} } @h }
' yourfile
The operation of the code looks at if the 1st field has been encountered earlier, then the key by that
name would exist in the hash, and so we go ahead and undef
the value for this particular key, as
there's no point in building up an array that will anyway be discarded at the end. Instead we carry the same information by a lesser memory imprint.
And in the scenario of seeing the 1st field the very 1st time, we populate the hash %h
with the current line and simultaneously append the array @h
with this key. We undertake this step to retain the order in which the keys were encountered. If we don't care for the order, then we can very well do away with this step.
Finally, when all input has been digested, in the end END
block, loop over the elements of the array @h
and from those fish out only those for whom the hash %h
has defined values. Remember, undef
values means must have seen more than once.
Related videos on Youtube
Michael
Updated on September 18, 2022Comments
-
Michael over 1 year
Following input:
A 13 A 12 B 17 C 33 D 344 C 24 A 5 C 99
I want to get only the lines where column one is unique:
B 17 D 344
A solution with
awk
would be nice, but something else is acceptable as well. -
Tigger about 7 yearsForgot about
-w
foruniq
. This would be the way to go. -
Sparhawk about 7 years@Tigger One possible caveat is if the first field is not always the same number of characters long. Then it might need something more sophisticated. It's also odd that
uniq
has-s
and-w
, but only-f
, and nothing analogous to only check the first N fields. -
Sparhawk about 7 yearsNice (+1)! An explanation would be great though.
-
Michael about 7 yearsThis seems to be the solution I go for. Follow up question: How do I give stdout from proceeding command two times to awk? (like inputting two tims the same file)
-
Admin about 7 yearsThere's a slight problem if we had a line with the 2nd field as "r" and this line appeared only once in the file. Then this will be discarded by awk code, but it shouldn't per spec.
-
Satō Katsura about 7 years@Michael I updated my answer.
-
Kusalananda about 7 yearsGNU coreutils'
uniq
has-w
. Others do not. -
Satō Katsura about 7 yearsFor what it's worth: GNU
grep
accepts-f -
instead of-f /dev/stdin
. -
Kusalananda about 7 years@SatoKatsura It may do, but my
grep
doesn't. -
Satō Katsura about 7 yearsYup, BSD
grep
doesn't. -
Costas about 7 years@RakeshSharma Have edited to avoid noted