Is there a way to 'uniq' by column?

211,411

Solution 1

sort -u -t, -k1,1 file
  • -u for unique
  • -t, so comma is the delimiter
  • -k1,1 for the key field 1

Test result:

[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

Solution 2

awk -F"," '!_[$1]++' file
  • -F sets the field separator.
  • $1 is the first field.
  • _[val] looks up val in the hash _(a regular variable).
  • ++ increment, and return old value.
  • ! returns logical not.
  • there is an implicit print at the end.

Solution 3

To consider multiple column.

Sort and give unique list based on column 1 and column 3:

sort -u -t : -k 1,1 -k 3,3 test.txt
  • -t : colon is separator
  • -k 1,1 -k 3,3 based on column 1 and column 3

Solution 4

If you want to use uniq:

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

gives:

1 01:05:47.893000000 2009-11-27 [email protected]
2 00:58:29.793000000 2009-11-27 [email protected]
1

Solution 5

If you want to retain the last one of the duplicates you could use

 tac a.csv | sort -u -t, -r -k1,1 |tac

Which was my requirement

here

tac will reverse the file line by line

Share:
211,411

Related videos on Youtube

Eno
Author by

Eno

Updated on April 27, 2020

Comments

  • Eno
    Eno about 4 years

    I have a .csv file like this:

    [email protected],2009-11-27 01:05:47.893000000,example.net,127.0.0.1
    [email protected],2009-11-27 00:58:29.793000000,example.net,255.255.255.0
    [email protected],2009-11-27 00:58:29.646465785,example.net,256.255.255.0
    ...
    

    I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing [email protected] in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.

    I tried something with sort | uniq but it doesn't work.

  • Carl Smotricz
    Carl Smotricz over 14 years
    I'd like to point out a possible simplification: You can dump the cat! Rather than piping into tr, just let tr read the file using <. Piping through cat is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had.
  • Javid Jamae
    Javid Jamae over 9 years
    This isn't unique by column as asked for in the question. This is just unique for the entire line. Also, you don't have to do a sort to do a uniq. The two are mutually exclusive.
  • Mikael S
    Mikael S over 9 years
    Yes, you are right. The last example does what the question asked for though, even though the accepted answer is a lot cleaner. Regarding sort, then uniq, sort needs to be done before doing uniq otherwise it doesn't work (but you can skip the second command and just use sort -u). From uniq(1): "Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output)."
  • hello_there_andy
    hello_there_andy over 9 years
    why do you need the ,1 in -k1,1? why not just -k1?
  • Serrano
    Serrano over 9 years
    @hello_there_andy: This is explained in the manual (man sort). It stands for the start and stop position.
  • Alex Bitek
    Alex Bitek about 9 years
    This approach is two times faster than sort
  • AffluentOwl
    AffluentOwl about 9 years
    This also has the additional benefit of keeping the lines in the original order!
  • Sukima
    Sukima over 8 years
    If you need the last uniq instead of the first then this awk script will help: awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }' file
  • ingyhere
    ingyhere over 8 years
    He does not want to purge lines, he wants to retain a single copy of a line with a specific string. Uniq is the right use case.
  • Geremia
    Geremia about 8 years
    How does it decide which line with a duplicate field to output? Is it the first occurrence of the duplicate before sorting?
  • Geremia
    Geremia about 8 years
    @CarlSmotricz: I tested it and it confirmed what sort's manpage says: "-u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run." So, it is indeed "the first occurrence of the duplicate before sorting."
  • Soham Chowdhury
    Soham Chowdhury over 7 years
    @eshwar just add more fields to the dictionary index! For instance, !_[$1][$2]++ can be used to sort by the first two fields. My awk-fu isn't strong enough to be able to unique on a range of fields, though. :(
  • marek094
    marek094 almost 6 years
    Note — you want use this solution if you are reading from stdin: <cmnd> | awk -F"," '!_[$1]++' -
  • rkachach
    rkachach about 5 years
    this changes the order of the lines as well, doesn't it?
  • rkachach
    rkachach about 5 years
    Brilliant! this option is better than the answer because it keeps the lines order
  • Corentin Limier
    Corentin Limier almost 5 years
    This solution has pros but also cons : it keeps the first column in memory for each line : could be greedier than sort -u for big files.
  • Hielke Walinga
    Hielke Walinga almost 5 years
    The reversing of fields can be simplified with rev.
  • Max Waterman
    Max Waterman over 4 years
    It does answer the specific question, but the title doesn't reflect that - ie there are other options to 'uniq' that 'sort -u' doesn't apply to - eg simply reporting which lines are duplicated (and not produce output for lines that are unique). I wonder why 'uniq' has a '--skip-fields=N' option, but does not have an option to select which field to compare...it seems like an obvious thing to have.
  • Fixee
    Fixee over 2 years
    @HielkeWalinga I thought rev reverses the characters in each line, not fields?!
  • Hielke Walinga
    Hielke Walinga over 2 years
    @Fixee Yes but in that way also the order of the fields, and it doesn't matter for the uniqueness the fields that the characters are reversed. So like this: <mycsv.cvs tr -s , ' ' | rev | uniq -f 4 | rev