Can sort | uniq or sort | awk count repeated values from one column?

17,075

Solution 1

Your question isn't quite clear so I just reverse engineered your output to your input (assuming there is a typo in your output since you mention to count number of 1's, 2's and 3's from column 2 and show 2 R,2). You'll probably need to explain your question a little better -

sort -t "," -k2 < p1.csv |   
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |  
sort -k1

Explaination:

 - !z[$2]++ removes the duplicates based on column 2 as awk progresses thru 
   each line. 
 - a[$2]=$0 stores the non-duplicates lines in an array
 - END {..} looks at all the keys in array and pulls up values. For array a 
   it pulls up the first line it sees with unique column 2 (as your desired 
   output). For array z it pulls up number of lines seen with same column 2.

Test:

[jaypal:~/temp] cat file
R,3
R,4
S,1
S,2
S,3
R,2
T,4
R,3
ST,4
RST,2
RSTR,4

[jaypal:~/temp] sort -t "," -k2 < t |   
awk -F, '!z[$2]++{ a[$2]=$0; } END {for (i in a) print z[i], a[i]}' |  
sort -k1
1 S,1
3 R,2
3 R,3
4 R,4

Sort with -u option

For finding unique entries based on column, you can try sort with -u option (but it won't give you the counts though).

From the man page:

-u, --unique
              with -c, check for strict ordering; 
              without -c, output only the first of an equal run

You can try something like this -

sort -t, -k2 p1.csv | sort -u -t, -k2

Using Uniq

I am not sure Uniq can be performed on a column separated by a delimiter other than a blank. Atleast on my mac it doesn't. Here is the man page reference

-f num  Ignore the first num fields in each input line when doing comparisons.  
        A field is a string of non-blank characters separated
        from adjacent fields by blanks.  Field numbers are one based, 
        i.e., the first field is field one.

So if you can remove the , delimiter and run the following command you should get your desired result.

sort -k2 test | uniq -c -f1

Test:

[jaypal:~/temp] cat test
R 3
R 4
S 1
S 2
S 3
R 2
T 4
R 3
ST 4
RST 2
RSTR 4

[jaypal:~/temp] sort -k2 test | uniq -c -f1
   1 S 1
   3 R 2
   3 R 3
   4 R 4

Solution 2

I just ran into an interesting case of repeated values in the data set that was easily resolvable and I just wanted to throw in the case for anyone that might be interested.

I'm adding two additions (*) to the dataset:

$cat p1.space
  R 3
  R 4
  S 1
  S 2
  S 3
  R 2
  T 4
  R 3
  ST 4
  RST 2
  RSTR 4
  R* 5
  R* 5

The new value includes a repeated entry R* 5. Sorting on column two, we see the new values below:

$sort -k2 p1.space
  S 1
  R 2
  RST 2
  S 2
  R 3
  R 3
  S 3
  R 4
  RSTR 4
  ST 4
  T 4
  R* 5
  R* 5

Notice how the repeated item (R* 5) is counted as two entries:

$sort -k2 p1.space | uniq -f1 -c
  1 S 1
  3 R 2
  3 R 3
  4 R 4
  2 R* 5

To save the counts for uniq rows, we have to uniq the dataset before counting the number of uniq rows:

$sort -k2 p1.space | uniq | uniq -f1 -c
  1 S 1
  3 R 2
  2 R 3
  4 R 4
  1 R* 5

I hope that helps anyone who might have a similar counting requirement. Good luck! And thank you @Jaypal.

Share:
17,075
blehman
Author by

blehman

toolbox: python, d3js Brian is a client facing data scientist who builds custom visual tools. He moves complicated data science research into actionable insights and eye catching stories.

Updated on June 05, 2022

Comments

  • blehman
    blehman almost 2 years

    Using the following data:

    $cat p1.csv
    R,3
    R,4
    S,1
    S,2
    S,3
    R,2
    T,4
    R,3
    ST,4
    RST,2
    RSTR,4
    

    First sort based on column 2:

    $cat p1.csv | sort -t "," -k2
    S,1
    R,2
    RST,2
    S,2
    R,3
    R,3
    S,3
    R,4
    RSTR,4
    ST,4
    T,4
    

    I want to count the number of 1's, 2's, 3's, and 4's from column two. Something like $cat p1.csv | sort -t "," -k2 | uniq -f2 -c .....Can uniq be applied to one column? The -f2 doesn't properly apply uniq to the correct field. The output should take the first instance of a unique value in column two and count the number of them. Hence, the data must first be sorted on column two. A correct output would look like:

    1 S,1
    3 R,2
    3 R,3
    4 R,4
    

    Suggestions?

  • blehman
    blehman almost 11 years
    $cat p1.csv | sort -t "," -k2 | uniq -f2 -c .....Can uniq be applied to one column? The -f2 doesn't properly apply uniq to the correct field.
  • blehman
    blehman almost 11 years
    @Jaypal: if you have a moment, could you possibly explain the parts of your awk function??
  • jaypal singh
    jaypal singh almost 11 years
    @simplyclimb Added the explanation and answer to your first question. Hope that helps.