Find the duplicated line in a large file Java

java performance duplicates treemap

14,096

Solution 1

The most efficient implementation really depends on your requirements.

From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.

In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).

Here's the example:

boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
    {
        if (lines.contains(line)) {
            hasDuplicate = true;
        }
        lines.add(line);
    }

    if (hasDuplicate){
        System.out.print("NOT UNIQUE");
    } else {
        System.out.print("UNIQUE");
    }
    list.close();
}

Solution 2

This is a well known problem called Count-Distinct Problem There are various algorithms :

In Java you can use BitSet

14,096

Author by

walterudoing

Updated on June 04, 2022

Comments

walterudoing almost 2 years
So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates.

I put the lines in a TreeMap so that they are sorted, put "lines" into key and give "1" to their value. When there is a duplicate, the value of the line stacks up. Then I will have to see if there is any value that is not 1.

Here is my code:
```
    BufferedReader list = new BufferedReader( new FileReader( args[0] ) );
    String line;
    TreeMap<String,Integer> map  = new TreeMap<String,Integer>();

    while ( (line = list.readLine()) != null )
    {
        if (!map.containsKey(line)) 
        {
            map.put(line, 0);
        }
        map.put(line, map.get(line) + 1);   
    }

    if ( !map.containsKey(1)  )
    {
        System.out.print("NOT UNIQUE");
    }
    else
    {
        System.out.print("UNIQUE");
    }
    list.close();
}
```
Question:
1. Will the use of TreeMap speed up the process? Or using HashMap will have the same/faster speed?
2. The output:
  
  Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.containsKey(TreeMap.java:227) at Lab10.main(Lab10.java:22)
which is if ( !map.containsKey(1) ) , but I don't know what went wrong.
Girish007 over 6 years

for huge files, how to take care the memory in this case?