How Does One Remove Duplicate Text Lines From Files Larger Than 4GB?
Solution 1
A handy Win32 native port of sort
is in UnxUtils
For more complicated meanings of "remove duplicates" there is Perl (et al).
Solution 2
If you have Cygwin or MinGW you could probably accomplish this with
cat file | sort | uniq >> outfile
assuming you want unique lines. I know not how this will perform, since sorting a dataset that large will probably take a long time (or if it is already sorted you can just leave that part out) or how, exactly, these commands function (if they will consume 4GB of ram or not).
Solution 3
You can remove duplicate lines in a huge file with PilotEdit.
Solution 4
I found a tool called PilotEdit which was able to do it.
Related videos on Youtube
darkAsPitch
Updated on September 17, 2022Comments
-
darkAsPitch almost 2 years
I am looking for an open source (possibly 64 bit) windows text editor that will allow me to remove duplicate lines from an extremely large (4GB+) text file.
What do you use to remove duplicate lines from your large text files?
-
Admin over 13 yearsduplicates of .. what? words? lines of words? provide a sample (considerably shorter than 4gb)
-
Admin over 13 yearsAdded the Windows tag, since this is a Windows-specific question.
-
-
darkAsPitch over 13 yearsThank you! CygWin and the sort command was exactly what I needed!
-
darkAsPitch over 13 yearsThank you for the reply, but UnxUtils was unavailable for download when I attempted it.
-
user5249203 over 13 yearsYou can download UnxUtils from sourceforge.net/projects/unxutils/files/unxutils/current/…
-
Raz3rt almost 11 yearsIt doesn't seem to work for large files unfortunately, there is a bug (i think) in the UnxUtils and it complains about not being able to read from /tmp/<temp_file>...
-
user5249203 almost 11 years@Gordon: Interesting. How large is "large" in MB or GBytes? and what O/S and filesystem, how much free space?
-
Raz3rt almost 11 yearsThe OS was windows 2008 R2 Datacenter running on Amazon Web Services. The file was about 2Gb. It's only a smallish instance so it might have been down to limited RAM/diskspace. Maybe the error message is misleading. I gave up and got it sorted using a cygwin port on the same instance.
-
Rishi Dua about 10 yearsWas already posted by @Dracoder