sort | uniq | xargs grep ... where lines contain spaces

14,881

Solution 1

  1. sort -k5,5 will do the sort on fields and avoid the cut;
  2. uniq -f 4 will ignore the first 4 fields for the uniq;
  3. Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
  4. but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

So:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16

Solution 2

The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:

tr '\n' '\000'

To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:

cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

Edit: the position of tr in the pipe was wrong.

Solution 3

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:

cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv

Solution 4

This is a good candidate for awk:

BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
  1. Set field seperator to ',' (CSV).
  2. Split fifth field on the space, stick result in A.
  3. Concatenate the line number to the list of what we have already stored for that date.
  4. Print out the line numbers for each date.
Share:
14,881
Sukotto
Author by

Sukotto

If you are distressed by anything external, the pain is not due to the thing itself, but to your estimate of it; and this you have the power to revoke at any moment. ― Marcus Aurelius

Updated on June 12, 2022

Comments

  • Sukotto
    Sukotto almost 2 years

    I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)

    I'm using a bash shell via cygwin for WinXP

    $ cut -d, -f 5 myfile.csv | sort | uniq -d 
    

    correctly returns a list of the duplicate dates

    01/01/2005 00:22
    01/01/2005 00:37
    [snip]    
    02/29/2009 23:54
    

    But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.

    So, given that

     $ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
    

    doesn't work... what can I do?

    I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.

    Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

  • kmkaplan
    kmkaplan about 15 years
    Yes +1. and tr '\t' ',' at the end if the CSV format is important.
  • Felipe Alvarez
    Felipe Alvarez almost 13 years
    tr '\n' '\000' --- exactly what i was looking for