Is there a way to 'uniq' by column?
211,411
Solution 1
sort -u -t, -k1,1 file
-u
for unique-t,
so comma is the delimiter-k1,1
for the key field 1
Test result:
[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
Solution 2
awk -F"," '!_[$1]++' file
-F
sets the field separator.$1
is the first field._[val]
looks upval
in the hash_
(a regular variable).++
increment, and return old value.!
returns logical not.- there is an implicit print at the end.
Solution 3
To consider multiple column.
Sort and give unique list based on column 1 and column 3:
sort -u -t : -k 1,1 -k 3,3 test.txt
-t :
colon is separator-k 1,1 -k 3,3
based on column 1 and column 3
Solution 4
If you want to use uniq
:
<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2
gives:
1 01:05:47.893000000 2009-11-27 [email protected]
2 00:58:29.793000000 2009-11-27 [email protected]
1
Solution 5
If you want to retain the last one of the duplicates you could use
tac a.csv | sort -u -t, -r -k1,1 |tac
Which was my requirement
here
tac
will reverse the file line by line
Related videos on Youtube
Author by
Eno
Updated on April 27, 2020Comments
-
Eno about 4 years
I have a .csv file like this:
[email protected],2009-11-27 01:05:47.893000000,example.net,127.0.0.1 [email protected],2009-11-27 00:58:29.793000000,example.net,255.255.255.0 [email protected],2009-11-27 00:58:29.646465785,example.net,256.255.255.0 ...
I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing
[email protected]
in the above example). How do I useuniq
on only field 1 (separated by commas)? According toman
,uniq
doesn't have options for columns.I tried something with
sort | uniq
but it doesn't work. -
Carl Smotricz over 14 yearsI'd like to point out a possible simplification: You can dump the
cat
! Rather than piping into tr, just let tr read the file using<
. Piping throughcat
is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had. -
Javid Jamae over 9 yearsThis isn't unique by column as asked for in the question. This is just unique for the entire line. Also, you don't have to do a sort to do a uniq. The two are mutually exclusive.
-
Mikael S over 9 yearsYes, you are right. The last example does what the question asked for though, even though the accepted answer is a lot cleaner. Regarding
sort
, thenuniq
,sort
needs to be done before doinguniq
otherwise it doesn't work (but you can skip the second command and just usesort -u
). Fromuniq(1)
: "Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output)." -
hello_there_andy over 9 yearswhy do you need the ,1 in -k1,1? why not just -k1?
-
Serrano over 9 years@hello_there_andy: This is explained in the manual (
man sort
). It stands for the start and stop position. -
Alex Bitek about 9 yearsThis approach is two times faster than sort
-
AffluentOwl about 9 yearsThis also has the additional benefit of keeping the lines in the original order!
-
Sukima over 8 yearsIf you need the last uniq instead of the first then this awk script will help:
awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }' file
-
ingyhere over 8 yearsHe does not want to purge lines, he wants to retain a single copy of a line with a specific string. Uniq is the right use case.
-
Geremia about 8 yearsHow does it decide which line with a duplicate field to output? Is it the first occurrence of the duplicate before sorting?
-
Geremia about 8 years@CarlSmotricz: I tested it and it confirmed what
sort
's manpage says: "-u
,--unique
with-c
, check for strict ordering; without-c
, output only the first of an equal run." So, it is indeed "the first occurrence of the duplicate before sorting." -
Soham Chowdhury over 7 years@eshwar just add more fields to the dictionary index! For instance,
!_[$1][$2]++
can be used to sort by the first two fields. Myawk
-fu isn't strong enough to be able to unique on a range of fields, though. :( -
marek094 almost 6 yearsNote — you want use this solution if you are reading from stdin:
<cmnd> | awk -F"," '!_[$1]++' -
-
rkachach about 5 yearsthis changes the order of the lines as well, doesn't it?
-
rkachach about 5 yearsBrilliant! this option is better than the answer because it keeps the lines order
-
Corentin Limier almost 5 yearsThis solution has pros but also cons : it keeps the first column in memory for each line : could be greedier than sort -u for big files.
-
Hielke Walinga almost 5 yearsThe reversing of fields can be simplified with
rev
. -
Max Waterman over 4 yearsIt does answer the specific question, but the title doesn't reflect that - ie there are other options to 'uniq' that 'sort -u' doesn't apply to - eg simply reporting which lines are duplicated (and not produce output for lines that are unique). I wonder why 'uniq' has a '--skip-fields=N' option, but does not have an option to select which field to compare...it seems like an obvious thing to have.
-
Fixee over 2 years@HielkeWalinga I thought
rev
reverses the characters in each line, not fields?! -
Hielke Walinga over 2 years@Fixee Yes but in that way also the order of the fields, and it doesn't matter for the uniqueness the fields that the characters are reversed. So like this:
<mycsv.cvs tr -s , ' ' | rev | uniq -f 4 | rev