How to delete duplicate lines in a file without sorting it in Unix

120,543

Solution 1

awk '!seen[$0]++' file.txt

seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.

The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on. AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

Solution 2

From http://sed.sourceforge.net/sed1line.txt: (Please don't ask me how this works ;-) )

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

Solution 3

Perl one-liner similar to jonas's AWK solution:

perl -ne 'print if ! $x{$_}++' file

This variation removes trailing white space before comparing:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

This variation edits the file in-place:

perl -i -ne 'print if ! $x{$_}++' file

This variation edits the file in-place, and makes a backup file.bak:

perl -i.bak -ne 'print if ! $x{$_}++' file

Solution 4

An alternative way using Vim (Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq

Solution 5

The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.

This is an infinite loop if the last line is blank and doesn't have any characterss:

sed '$!N; /^\(.*\)\n\1$/!P; D'

It doesn't hang, but you lose the last line:

sed '$d;N; /^\(.*\)\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".

Share:
120,543

Related videos on Youtube

Vijay
Author by

Vijay

http://theunixshell.blogspot.com/

Updated on October 29, 2021

Comments

  • Vijay
    Vijay over 2 years

    Is there a way to delete duplicate lines in a file in Unix?

    I can do it with sort -u and uniq commands, but I want to use sed or awk.

    Is that possible?

    • Michael Krelin - hacker
      Michael Krelin - hacker almost 15 years
      if you mean consecutive duplicates then uniq alone is enough.
    • Michael Krelin - hacker
      Michael Krelin - hacker almost 15 years
      and otherwise, I believe it's possible with awk, but will be quite resource consuming on bigger files.
    • tripleee
      tripleee over 5 years
      Duplicates stackoverflow.com/q/24324350 and stackoverflow.com/q/11532157 have interesting answers which should ideally be migrated here.
  • Michael Krelin - hacker
    Michael Krelin - hacker almost 15 years
    geekery;-) +1, but resource consumption is inavoidable.
  • Beta
    Beta almost 15 years
    '$!N; /^(.*)\n\1$/!P; D' means "If you're not at the last line, read in another line. Now look at what you have and if it ISN'T stuff followed by a newline and then the same stuff again, print out the stuff. Now delete the stuff (up to the newline)."
  • Beta
    Beta almost 15 years
    'G; s/\n/&&/; /^([ -~]*\n).*\n\1/d; s/\n//; h; P' means, roughly, "Append the whole hold space this line, then if you see a duplicated line throw the whole thing out, otherwise copy the whole mess back into the hold space and print the first part (which is the line you just read."
  • eddi
    eddi almost 12 years
    Is the $! part necessary? Doesn't sed 'N; /^\(.*\)\n\1$/!P; D' do the same thing? I can't come up with an example where the two are different on my machine (fwiw I did try an empty line at the end with both versions and they were both fine).
  • amichair
    amichair over 11 years
    The second solution doesn't work for me (on GNU sed 4.2.1), on a test file with only lowercase English letters and spaces. However, replacing [ -~] with . or [^\n] or even [ -z{|}~] (the exact same set of characters) does the job. If anyone can explain the difference, that would be nice...
  • Vijay
    Vijay about 10 years
    This will disturb the order of the lines.
  • Alexander Lubyagin
    Alexander Lubyagin over 6 years
    What is about 20 GB text file? Too slow.
  • Akash Kandpal
    Akash Kandpal almost 6 years
    To save it in a file we can do this awk '!seen[$0]++' merge_all.txt > output.txt
  • tripleee
    tripleee over 5 years
    As ever, the cat is useless. Anyway, uniq already does this by itself, and doesn't require the input to be exactly one word per line.
  • Nick K9
    Nick K9 over 5 years
    An important caveat here: if you need to do this for multiple files, and you tack more files on the end of the command, or use a wildcard… the 'seen' array will fill up with duplicate lines from ALL the files. If you instead want to treat each file independently, you'll need to do something like for f in *.txt; do gawk -i inplace '!seen[$0]++' "$f"; done
  • B Layer
    B Layer over 4 years
    Almost 7 years later and no one answered @amichair ... <sniff> makes me sad. ;) Anyways, [ -~] represents a range of ASCII characters from 0x20 (space) to 0x7E (tilde). These are considered the printable ASCII characters (linked page also has 0x7F/delete but that doesn't seem right). That makes the solution broken for anyone not using ASCII or anyone using, say, tab characters.. The more portable [^\n] includes a whole lot more characters...all of 'em except one, in fact.
  • amichair
    amichair over 4 years
    Thanks for caring, @BLayer :-) I think I may have been asking about the second case - [ -z{|}~] and [ -~] seem to select the same range of ASCII characters, yet one worked and the other did not...
  • B Layer
    B Layer over 4 years
    @amichair You'll never walk aloooone. :D Alas, I think I mistakenly read "space" as "whitespace" and assumed you had a Tab somewhere in there. Maybe it was a bug in sed. Can you still reproduce? I can't with gnu sed 4.4. Only other thing that comes to mind is [..] ranges being non-portable across different locales (i.e. LC_COLLATE, fixed by setting LC_ALL=C) but that seems like a stretch esp. since it sounds like you know what you're doing. Sorry for raising false hopes. ;)
  • amichair
    amichair over 4 years
    @BLayer Nope, on GNU sed 4.4 on Ubuntu 18.04 [ -~] works for me but [ -z{|}~] does not in the second command (non-consecutive lines, e.g. pipe echo -e "1\n2\n3\n1\n4\n3\n" into the command).
  • sfscs
    sfscs over 4 years
    @NickK9 that de-duping cumulatively across multiple files is awesome in itself. Nice tip
  • scavenger
    scavenger over 4 years
    same command on Windows with busybox: busybox echo -e "1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5" | busybox sed -nr "$!N;/^(.*)\n\1$/!P;D"
  • honzajde
    honzajde over 3 years
    It also works thanks to the fact that the result of '++' operator is not the value after increment, but the previous value.
  • Esmu Igors
    Esmu Igors over 3 years
    I think awk is an overkill here.
  • einpoklum
    einpoklum over 3 years
    This will only remove consecutive duplicates.
  • Al Bundy
    Al Bundy over 3 years
    How to redirect otuput to stdout? Piping does not work with this approach.
  • Chris Koknat
    Chris Koknat over 3 years
    My original answer outputs to stdout, as well as the first variation