sort | uniq | xargs grep ... where lines contain spaces
Solution 1
- sort -k5,5 will do the sort on fields and avoid the cut;
- uniq -f 4 will ignore the first 4 fields for the uniq;
- Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
- but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
Solution 2
The -z
option of uniq
needs the input to be NUL separated. You can filter the output of cut
through:
tr '\n' '\000'
To get zero separated rows. Then sort
, uniq
and xargs
have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr
in the pipe was wrong.
Solution 3
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
Solution 4
This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
- Set field seperator to ',' (CSV).
- Split fifth field on the space, stick result in A.
- Concatenate the line number to the list of what we have already stored for that date.
- Print out the line numbers for each date.
Sukotto
If you are distressed by anything external, the pain is not due to the thing itself, but to your estimate of it; and this you have the power to revoke at any moment. ― Marcus Aurelius
Updated on June 12, 2022Comments
-
Sukotto almost 2 years
I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22 01/01/2005 00:37 [snip] 02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use
xargs
straight up since the output contains spaces. I thought I could douniq -z -d
but for some reason, combining those flags causes uniq to (apparently) return nothing.So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in
perl
or another scripting language... but my stubborn nature insists that I should be able to do it inbash
using standard commandline tools likesort
,uniq
,find
,grep
,cut
, etc.Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
-
kmkaplan about 15 yearsYes +1. and tr '\t' ',' at the end if the CSV format is important.
-
Felipe Alvarez almost 13 yearstr '\n' '\000' --- exactly what i was looking for