uniq command not working properly?

37,320

Solution 1

You need to use sort before uniq:

find . -type f -exec md5sum {} ';' | sort | uniq -w 33

uniq only removes repeated lines. It does not re-order the lines looking for repeats. sort does that part.

This is documented in man uniq:

Note: uniq does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use sort -u without uniq.

Solution 2

The input for uniq needs to be sorted. So for the example case,

find . -type f -exec md5sum '{}' ';' | sort | uniq -w 33

would work. The -w (--check-chars=N) makes the lines unique only regarding the first column; This option works for this case. but the possibilities to specify the relevant parts of the line for uniq are limited. For example, there are no options to specify working on some column 3 and 5, ignoring column 4.

The command sort has an option for unique output lines itself, and the lines are unique regarding the keys used for sorting. This means we can make use of the powerful key syntax of sort to define regarding which part the lines should be uniq.

For the example,

find . -type f -exec md5sum '{}' ';' | sort -k 1,1 -u

gives just the same result, but the sort part is more flexible for other uses.

Share:
37,320

Related videos on Youtube

user2127726
Author by

user2127726

Updated on September 18, 2022

Comments

  • user2127726
    user2127726 almost 2 years

    So I'm checking the md5 hash of my files with this as my output:

    657cf4512a77bf47c39a0482be8e41e0  ./dupes2.txt
    657cf4512a77bf47c39a0482be8e41e0  ./dupes.txt
    8d60a927ce0f411ec94ac26a4785f749  ./derpina.txt
    15f63928b8a1d5337137c38b5d66eed3  ./foo.txt
    8d60a927ce0f411ec94ac26a4785f749  ./derp.txt
    

    However, after running find . -type f -exec md5sum '{}' ';' | uniq -w 33 to find the unique hashes I get this:

    657cf4512a77bf47c39a0482be8e41e0  ./dupes2.txt
    8d60a927ce0f411ec94ac26a4785f749  ./derpina.txt
    15f63928b8a1d5337137c38b5d66eed3  ./foo.txt
    8d60a927ce0f411ec94ac26a4785f749  ./derp.txt
    

    From my understanding, only one of either derpina.txt or derp.txt should be showing up since their hashes are the same. Am I missing something? Can anyone enlighten me as to why it outputs like this?

  • Devaroop
    Devaroop about 5 years
    uniq should be aliased as sort -u by default in all systems. If at all it always needs "sort" for it to work properly.
  • John1024
    John1024 about 5 years
    That change would lessen some confusion. On the other hand, uniq has many features not available with sort -u. Also, there are cases where one wants to use uniq without sort.