Deleting duplicate lines in a file using Java
Solution 1
Hmm... 40 megs seems small enough that you could build a Set
of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) {
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for (String unique : lines) {
writer.write(unique);
writer.newLine();
}
writer.close();
}
If the order is important, you could use a LinkedHashSet
instead of a HashSet
. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet
instead of LinkedHashSet
. But I doubt you'd notice the difference on an I/O bound operation like this one.
Solution 2
Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:
Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
Read Line.
If not(Line in hashSet)
Add Line to hashset.
Write Line to output.
End If.
End While.
Free hashset.
Close input.
Close output.
Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.
Solution 3
A similar approach
public void stripDuplicatesFromFile(String filename) {
IOUtils.writeLines(
new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
"\n", new FileOutputStream(filename + ".uniq"));
}
Solution 4
Something like this, perhaps:
BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
out.println(line);
LinkedHashSet
keeps the insertion order, as opposed to HashSet
which (while being slightly faster for lookup/insert) will reorder all lines.
Solution 5
If the order does not matter, the simplest way is shell scripting:
<infile sort | uniq > outfile
Related videos on Youtube
Monster
Updated on July 09, 2022Comments
-
Monster almost 2 years
As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
-
Peter Perháč almost 15 years9 answers and no votes up? this is a perfectly valid and well formulated question
-
-
David Johnstone almost 15 yearsyou're better off with some sort of set rather than a map
-
z - almost 15 yearsyea, 40 megs is nothing, read the whole thing into memory, dump it to a hashset to keep only unique lines, write it back to disk.
-
fmunshi almost 15 yearsDepending on the questioner's requirements, you may need to keep track of the line number, because iterating over a HashSet will return the lines in a pretty arbitrary order.
-
Philipp almost 15 yearsYou could initialize the hashset with a value like #lines / 0.75 because HashSet will make a new table and rehash everything if it reaches its default fill-grade of 75%. Another possibility would be to create the HashSet with a fillgrade of 1.0f (100%) and a size that is a bit bigger then your data-count -> "new HashSet(300000, 1.0f)". This way you can avoid expensive rehashing.
-
Jonik almost 15 yearsYou could simplify this code by using readLines() and writeLines() from Commons IO's FileUtils, commons.apache.org/io/api-release/org/apache/commons/io/…. (I'm not sure whether that would affect the scalability though.)
-
Monster almost 15 yearsHmmm, I tried implementing this, but I get the error "java.lang.OutOfMemoryError: Java heap space". I tried increasing the HashSet size, but no good. Ideas? Thanks!
-
Wim ten Brink almost 15 yearsI've done something similar in Delphi once, although I had to write my own HashSet class to do this. The only drawback is that you need lots of memory with huge files, which is fine if you do this client-side but not on a server. Basically, the project that needed this managed to read a file of 500k lines and delete all duplicates within two minutes.
-
Wim ten Brink almost 15 yearsHowever, I just read a line, checked if it was in the hash-set and if it wasn't, I would add it and write it to file. Otherwise, I'd just skip to the next line. That way, I'm not reading back from the hashset and best of all: I kept all lines in the same order.
-
palantus almost 15 yearsNot a bad idea, but since the input file is only 40 megabytes I don't think that it will be a problem.
-
mikek almost 15 yearsI guess. But parallelizing stuff is phun! :3
-
Philipp almost 15 yearshe will most likely need more then 64M of ram. why? 40MB us-ascii-test-file -> 80MB as strings + HashSet overhead + Object overhead + .... I'd go with 512 MB or so :)
-
palantus almost 15 yearsAh- but he's not storing duplicate lines, so it depends on how many duplicates there are. (You're likely right, though, and it doesn't hurt to have a larger allocation on a short-running program like this one.)
-
Monster almost 15 yearsAh, I set the xmx to 512 and it seemed to work. Great fix! Duplicates gone! Thanks guys!
-
Monster almost 15 yearsAnd as a final aside, I did end up using LinkedHashSet. While order isn't hugely important, it makes tracking things a lot easier. And the overhead is nil. Thanks again everyone!
-
gustafc almost 15 years+1 for stating the bleeding obvious I should've seen when writing my answer. D'oh! :)
-
palantus almost 15 yearsTrue; I was doing it without a temporary file, but it might be a little more efficient with one (no LinkedHashSet necessary). But I'd venture a guess that CPU isn't going to be the bottleneck anyway.
-
palantus almost 15 yearsEr, my comment was directed at Workshop Alex, not gustafc.
-
Wim ten Brink almost 15 yearsOf course, instead of using an output file, you could output to an unsorted string list, in memory. Then, when you're done adding the input without duplicates, write the string list over the old input file. It does mean you'll be using twice as much memory than with other solutions, but it's still extremely fast.
-
palantus almost 15 years@Workshop Alex: That's basically what I did. Why do you say it uses twice as much memory?
-
Wim ten Brink almost 15 yearsThat's because it stores the strings twice: once in the hash table and once in the string list. (Then again, chances are that both the hashset and string list only store references to strings, in which case it won't eat up that much.)
-
palantus almost 15 yearsYes, they do store references. The extra overhead is probably not even enough to notice, at 8 bytes per unique string.
-
Wim ten Brink almost 15 yearsSimple calculation: 225k lines, times 8 for every reference makes 1.8 megabytes. With two lists, that doubles to 3.6 megabytes. Then again, if 90% are duplicates then you can reduce these number again by 90%...
-
Jonik almost 15 yearsShouldn't the latter FileInputStream actually be a FileOutputStream? Other than that, +1 for a simplicity and "knowing and using the libraries".
-
Jonik almost 15 yearsAlso, it's worth mentioning that IOUtils is from Apache Commons IO (commons.apache.org/io); that probably isn't obvious to every reader.
-
Vladimir Stazhilov over 7 yearsThis exact implementation in Scala blog.cyberwhale.tech/2017/01/09/…
-
Vaibs about 7 yearsSet are meant for that only. (y)
-
HopeKing over 6 yearssuper efficient ! I had to process 30,000 files with 100 lines each and remove duplicates. This took 10 mins while the other solution took 3 hrs.
-
phihag about 5 years@nanosoft That would be a UUOC.