How Does One Remove Duplicate Text Lines From Files Larger Than 4GB?

10,293

Solution 1

sort -u file > outfile

A handy Win32 native port of sort is in UnxUtils

For more complicated meanings of "remove duplicates" there is Perl (et al).

Solution 2

If you have Cygwin or MinGW you could probably accomplish this with

cat file | sort | uniq >> outfile

assuming you want unique lines. I know not how this will perform, since sorting a dataset that large will probably take a long time (or if it is already sorted you can just leave that part out) or how, exactly, these commands function (if they will consume 4GB of ram or not).

Solution 3

You can remove duplicate lines in a huge file with PilotEdit.

Solution 4

I found a tool called PilotEdit which was able to do it.

Share:
10,293

Related videos on Youtube

darkAsPitch
Author by

darkAsPitch

Updated on September 17, 2022

Comments

  • darkAsPitch
    darkAsPitch almost 2 years

    I am looking for an open source (possibly 64 bit) windows text editor that will allow me to remove duplicate lines from an extremely large (4GB+) text file.

    What do you use to remove duplicate lines from your large text files?

    • Admin
      Admin over 13 years
      duplicates of .. what? words? lines of words? provide a sample (considerably shorter than 4gb)
    • Admin
      Admin over 13 years
      Added the Windows tag, since this is a Windows-specific question.
  • darkAsPitch
    darkAsPitch over 13 years
    Thank you! CygWin and the sort command was exactly what I needed!
  • darkAsPitch
    darkAsPitch over 13 years
    Thank you for the reply, but UnxUtils was unavailable for download when I attempted it.
  • user5249203
    user5249203 over 13 years
  • Raz3rt
    Raz3rt almost 11 years
    It doesn't seem to work for large files unfortunately, there is a bug (i think) in the UnxUtils and it complains about not being able to read from /tmp/<temp_file>...
  • user5249203
    user5249203 almost 11 years
    @Gordon: Interesting. How large is "large" in MB or GBytes? and what O/S and filesystem, how much free space?
  • Raz3rt
    Raz3rt almost 11 years
    The OS was windows 2008 R2 Datacenter running on Amazon Web Services. The file was about 2Gb. It's only a smallish instance so it might have been down to limited RAM/diskspace. Maybe the error message is misleading. I gave up and got it sorted using a cygwin port on the same instance.
  • Rishi Dua
    Rishi Dua about 10 years
    Was already posted by @Dracoder