Deleting duplicate lines in a file using Java

47,526

Solution 1

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

Solution 2

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

Solution 3

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}

Solution 4

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

Solution 5

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile
Share:
47,526

Related videos on Youtube

Monster
Author by

Monster

Updated on July 09, 2022

Comments

  • Monster
    Monster almost 2 years

    As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

    I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

    • Peter Perháč
      Peter Perháč almost 15 years
      9 answers and no votes up? this is a perfectly valid and well formulated question
  • David Johnstone
    David Johnstone almost 15 years
    you're better off with some sort of set rather than a map
  • z  -
    z - almost 15 years
    yea, 40 megs is nothing, read the whole thing into memory, dump it to a hashset to keep only unique lines, write it back to disk.
  • fmunshi
    fmunshi almost 15 years
    Depending on the questioner's requirements, you may need to keep track of the line number, because iterating over a HashSet will return the lines in a pretty arbitrary order.
  • Philipp
    Philipp almost 15 years
    You could initialize the hashset with a value like #lines / 0.75 because HashSet will make a new table and rehash everything if it reaches its default fill-grade of 75%. Another possibility would be to create the HashSet with a fillgrade of 1.0f (100%) and a size that is a bit bigger then your data-count -> "new HashSet(300000, 1.0f)". This way you can avoid expensive rehashing.
  • Jonik
    Jonik almost 15 years
    You could simplify this code by using readLines() and writeLines() from Commons IO's FileUtils, commons.apache.org/io/api-release/org/apache/commons/io/…. (I'm not sure whether that would affect the scalability though.)
  • Monster
    Monster almost 15 years
    Hmmm, I tried implementing this, but I get the error "java.lang.OutOfMemoryError: Java heap space". I tried increasing the HashSet size, but no good. Ideas? Thanks!
  • Wim ten Brink
    Wim ten Brink almost 15 years
    I've done something similar in Delphi once, although I had to write my own HashSet class to do this. The only drawback is that you need lots of memory with huge files, which is fine if you do this client-side but not on a server. Basically, the project that needed this managed to read a file of 500k lines and delete all duplicates within two minutes.
  • Wim ten Brink
    Wim ten Brink almost 15 years
    However, I just read a line, checked if it was in the hash-set and if it wasn't, I would add it and write it to file. Otherwise, I'd just skip to the next line. That way, I'm not reading back from the hashset and best of all: I kept all lines in the same order.
  • palantus
    palantus almost 15 years
    Not a bad idea, but since the input file is only 40 megabytes I don't think that it will be a problem.
  • mikek
    mikek almost 15 years
    I guess. But parallelizing stuff is phun! :3
  • Philipp
    Philipp almost 15 years
    he will most likely need more then 64M of ram. why? 40MB us-ascii-test-file -> 80MB as strings + HashSet overhead + Object overhead + .... I'd go with 512 MB or so :)
  • palantus
    palantus almost 15 years
    Ah- but he's not storing duplicate lines, so it depends on how many duplicates there are. (You're likely right, though, and it doesn't hurt to have a larger allocation on a short-running program like this one.)
  • Monster
    Monster almost 15 years
    Ah, I set the xmx to 512 and it seemed to work. Great fix! Duplicates gone! Thanks guys!
  • Monster
    Monster almost 15 years
    And as a final aside, I did end up using LinkedHashSet. While order isn't hugely important, it makes tracking things a lot easier. And the overhead is nil. Thanks again everyone!
  • gustafc
    gustafc almost 15 years
    +1 for stating the bleeding obvious I should've seen when writing my answer. D'oh! :)
  • palantus
    palantus almost 15 years
    True; I was doing it without a temporary file, but it might be a little more efficient with one (no LinkedHashSet necessary). But I'd venture a guess that CPU isn't going to be the bottleneck anyway.
  • palantus
    palantus almost 15 years
    Er, my comment was directed at Workshop Alex, not gustafc.
  • Wim ten Brink
    Wim ten Brink almost 15 years
    Of course, instead of using an output file, you could output to an unsorted string list, in memory. Then, when you're done adding the input without duplicates, write the string list over the old input file. It does mean you'll be using twice as much memory than with other solutions, but it's still extremely fast.
  • palantus
    palantus almost 15 years
    @Workshop Alex: That's basically what I did. Why do you say it uses twice as much memory?
  • Wim ten Brink
    Wim ten Brink almost 15 years
    That's because it stores the strings twice: once in the hash table and once in the string list. (Then again, chances are that both the hashset and string list only store references to strings, in which case it won't eat up that much.)
  • palantus
    palantus almost 15 years
    Yes, they do store references. The extra overhead is probably not even enough to notice, at 8 bytes per unique string.
  • Wim ten Brink
    Wim ten Brink almost 15 years
    Simple calculation: 225k lines, times 8 for every reference makes 1.8 megabytes. With two lists, that doubles to 3.6 megabytes. Then again, if 90% are duplicates then you can reduce these number again by 90%...
  • Jonik
    Jonik almost 15 years
    Shouldn't the latter FileInputStream actually be a FileOutputStream? Other than that, +1 for a simplicity and "knowing and using the libraries".
  • Jonik
    Jonik almost 15 years
    Also, it's worth mentioning that IOUtils is from Apache Commons IO (commons.apache.org/io); that probably isn't obvious to every reader.
  • Vladimir Stazhilov
    Vladimir Stazhilov over 7 years
    This exact implementation in Scala blog.cyberwhale.tech/2017/01/09/…
  • Vaibs
    Vaibs about 7 years
    Set are meant for that only. (y)
  • HopeKing
    HopeKing over 6 years
    super efficient ! I had to process 30,000 files with 100 lines each and remove duplicates. This took 10 mins while the other solution took 3 hrs.
  • phihag
    phihag about 5 years
    @nanosoft That would be a UUOC.