Is there a way to 'uniq' by column?

linux shell sorting uniq

211,411

Solution 1

sort -u -t, -k1,1 file

-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1

Test result:

[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

Solution 2

awk -F"," '!_[$1]++' file

-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.

Solution 3

To consider multiple column.

Sort and give unique list based on column 1 and column 3:

sort -u -t : -k 1,1 -k 3,3 test.txt

-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3

Solution 4

If you want to use uniq:

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

gives:

1 01:05:47.893000000 2009-11-27 [email protected]
2 00:58:29.793000000 2009-11-27 [email protected]
1

Solution 5

If you want to retain the last one of the duplicates you could use

 tac a.csv | sort -u -t, -r -k1,1 |tac

Which was my requirement

here

tac will reverse the file line by line

View more solutions

211,411

Eno

Updated on April 27, 2020

Comments

Eno about 4 years
I have a .csv file like this:
```
[email protected],2009-11-27 01:05:47.893000000,example.net,127.0.0.1
[email protected],2009-11-27 00:58:29.793000000,example.net,255.255.255.0
[email protected],2009-11-27 00:58:29.646465785,example.net,256.255.255.0
...
```
I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing [email protected] in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.

I tried something with sort | uniq but it doesn't work.
Carl Smotricz over 14 years

I'd like to point out a possible simplification: You can dump the cat! Rather than piping into tr, just let tr read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had.
Javid Jamae over 9 years

This isn't unique by column as asked for in the question. This is just unique for the entire line. Also, you don't have to do a sort to do a uniq. The two are mutually exclusive.
Mikael S over 9 years

Yes, you are right. The last example does what the question asked for though, even though the accepted answer is a lot cleaner. Regarding sort, then uniq, sort needs to be done before doing uniq otherwise it doesn't work (but you can skip the second command and just use sort -u). From uniq(1): "Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output)."
hello_there_andy over 9 years

why do you need the ,1 in -k1,1? why not just -k1?
Serrano over 9 years

@hello_there_andy: This is explained in the manual (man sort). It stands for the start and stop position.
Alex Bitek about 9 years

This approach is two times faster than sort
AffluentOwl about 9 years

This also has the additional benefit of keeping the lines in the original order!
Sukima over 8 years

If you need the last uniq instead of the first then this awk script will help: awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }' file
ingyhere over 8 years

He does not want to purge lines, he wants to retain a single copy of a line with a specific string. Uniq is the right use case.
Geremia about 8 years

How does it decide which line with a duplicate field to output? Is it the first occurrence of the duplicate before sorting?
Geremia about 8 years

@CarlSmotricz: I tested it and it confirmed what sort's manpage says: "-u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run." So, it is indeed "the first occurrence of the duplicate before sorting."
Soham Chowdhury over 7 years

@eshwar just add more fields to the dictionary index! For instance, !_[$1][$2]++ can be used to sort by the first two fields. My awk-fu isn't strong enough to be able to unique on a range of fields, though. :(
marek094 almost 6 years

Note — you want use this solution if you are reading from stdin: <cmnd> | awk -F"," '!_[$1]++' -
rkachach about 5 years

this changes the order of the lines as well, doesn't it?
rkachach about 5 years

Brilliant! this option is better than the answer because it keeps the lines order
Corentin Limier almost 5 years

This solution has pros but also cons : it keeps the first column in memory for each line : could be greedier than sort -u for big files.
Hielke Walinga almost 5 years

The reversing of fields can be simplified with rev.
Max Waterman over 4 years

It does answer the specific question, but the title doesn't reflect that - ie there are other options to 'uniq' that 'sort -u' doesn't apply to - eg simply reporting which lines are duplicated (and not produce output for lines that are unique). I wonder why 'uniq' has a '--skip-fields=N' option, but does not have an option to select which field to compare...it seems like an obvious thing to have.
Fixee over 2 years

@HielkeWalinga I thought rev reverses the characters in each line, not fields?!
Hielke Walinga over 2 years

@Fixee Yes but in that way also the order of the fields, and it doesn't matter for the uniqueness the fields that the characters are reversed. So like this: <mycsv.cvs tr -s , ' ' | rev | uniq -f 4 | rev