Deleting duplicate lines in a file using Java

java file text file-io duplicates

47,526

Solution 1

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n²) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

Solution 2

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

Solution 3

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}

Solution 4

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

Solution 5

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile

View more solutions

47,526

Monster

Updated on July 09, 2022

Comments

Monster almost 2 years

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
- Peter Perháč almost 15 years
  
  9 answers and no votes up? this is a perfectly valid and well formulated question
David Johnstone almost 15 years

you're better off with some sort of set rather than a map
z - almost 15 years

yea, 40 megs is nothing, read the whole thing into memory, dump it to a hashset to keep only unique lines, write it back to disk.
fmunshi almost 15 years

Depending on the questioner's requirements, you may need to keep track of the line number, because iterating over a HashSet will return the lines in a pretty arbitrary order.
Philipp almost 15 years

You could initialize the hashset with a value like #lines / 0.75 because HashSet will make a new table and rehash everything if it reaches its default fill-grade of 75%. Another possibility would be to create the HashSet with a fillgrade of 1.0f (100%) and a size that is a bit bigger then your data-count -> "new HashSet(300000, 1.0f)". This way you can avoid expensive rehashing.
Jonik almost 15 years

You could simplify this code by using readLines() and writeLines() from Commons IO's FileUtils, commons.apache.org/io/api-release/org/apache/commons/io/…. (I'm not sure whether that would affect the scalability though.)
Monster almost 15 years

Hmmm, I tried implementing this, but I get the error "java.lang.OutOfMemoryError: Java heap space". I tried increasing the HashSet size, but no good. Ideas? Thanks!
Wim ten Brink almost 15 years

I've done something similar in Delphi once, although I had to write my own HashSet class to do this. The only drawback is that you need lots of memory with huge files, which is fine if you do this client-side but not on a server. Basically, the project that needed this managed to read a file of 500k lines and delete all duplicates within two minutes.
Wim ten Brink almost 15 years

However, I just read a line, checked if it was in the hash-set and if it wasn't, I would add it and write it to file. Otherwise, I'd just skip to the next line. That way, I'm not reading back from the hashset and best of all: I kept all lines in the same order.
palantus almost 15 years

Not a bad idea, but since the input file is only 40 megabytes I don't think that it will be a problem.
mikek almost 15 years

I guess. But parallelizing stuff is phun! :3
Philipp almost 15 years

he will most likely need more then 64M of ram. why? 40MB us-ascii-test-file -> 80MB as strings + HashSet overhead + Object overhead + .... I'd go with 512 MB or so :)
palantus almost 15 years

Ah- but he's not storing duplicate lines, so it depends on how many duplicates there are. (You're likely right, though, and it doesn't hurt to have a larger allocation on a short-running program like this one.)
Monster almost 15 years

Ah, I set the xmx to 512 and it seemed to work. Great fix! Duplicates gone! Thanks guys!
Monster almost 15 years

And as a final aside, I did end up using LinkedHashSet. While order isn't hugely important, it makes tracking things a lot easier. And the overhead is nil. Thanks again everyone!
gustafc almost 15 years

+1 for stating the bleeding obvious I should've seen when writing my answer. D'oh! :)
palantus almost 15 years

True; I was doing it without a temporary file, but it might be a little more efficient with one (no LinkedHashSet necessary). But I'd venture a guess that CPU isn't going to be the bottleneck anyway.
palantus almost 15 years

Er, my comment was directed at Workshop Alex, not gustafc.
Wim ten Brink almost 15 years

Of course, instead of using an output file, you could output to an unsorted string list, in memory. Then, when you're done adding the input without duplicates, write the string list over the old input file. It does mean you'll be using twice as much memory than with other solutions, but it's still extremely fast.
palantus almost 15 years

@Workshop Alex: That's basically what I did. Why do you say it uses twice as much memory?
Wim ten Brink almost 15 years

That's because it stores the strings twice: once in the hash table and once in the string list. (Then again, chances are that both the hashset and string list only store references to strings, in which case it won't eat up that much.)
palantus almost 15 years

Yes, they do store references. The extra overhead is probably not even enough to notice, at 8 bytes per unique string.
Wim ten Brink almost 15 years

Simple calculation: 225k lines, times 8 for every reference makes 1.8 megabytes. With two lists, that doubles to 3.6 megabytes. Then again, if 90% are duplicates then you can reduce these number again by 90%...
Jonik almost 15 years

Shouldn't the latter FileInputStream actually be a FileOutputStream? Other than that, +1 for a simplicity and "knowing and using the libraries".
Jonik almost 15 years

Also, it's worth mentioning that IOUtils is from Apache Commons IO (commons.apache.org/io); that probably isn't obvious to every reader.
Vladimir Stazhilov over 7 years

This exact implementation in Scala blog.cyberwhale.tech/2017/01/09/…
Vaibs about 7 years

Set are meant for that only. (y)
HopeKing over 6 years

super efficient ! I had to process 30,000 files with 100 lines each and remove duplicates. This took 10 mins while the other solution took 3 hrs.
phihag about 5 years

@nanosoft That would be a UUOC.