Find the duplicated line in a large file Java

14,096

Solution 1

The most efficient implementation really depends on your requirements.

From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.

In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).

Here's the example:

boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
    {
        if (lines.contains(line)) {
            hasDuplicate = true;
        }
        lines.add(line);
    }

    if (hasDuplicate){
        System.out.print("NOT UNIQUE");
    } else {
        System.out.print("UNIQUE");
    }
    list.close();
}

Solution 2

This is a well known problem called Count-Distinct Problem There are various algorithms :

In Java you can use BitSet

Share:
14,096
walterudoing
Author by

walterudoing

Updated on June 04, 2022

Comments

  • walterudoing
    walterudoing almost 2 years

    So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates.

    I put the lines in a TreeMap so that they are sorted, put "lines" into key and give "1" to their value. When there is a duplicate, the value of the line stacks up. Then I will have to see if there is any value that is not 1.

    Here is my code:

        BufferedReader list = new BufferedReader( new FileReader( args[0] ) );
        String line;
        TreeMap<String,Integer> map  = new TreeMap<String,Integer>();
    
        while ( (line = list.readLine()) != null )
        {
            if (!map.containsKey(line)) 
            {
                map.put(line, 0);
            }
            map.put(line, map.get(line) + 1);   
        }
    
        if ( !map.containsKey(1)  )
        {
            System.out.print("NOT UNIQUE");
        }
        else
        {
            System.out.print("UNIQUE");
        }
        list.close();
    }
    

    Question:

    1. Will the use of TreeMap speed up the process? Or using HashMap will have the same/faster speed?

    2. The output:

      Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.containsKey(TreeMap.java:227) at Lab10.main(Lab10.java:22)

    which is if ( !map.containsKey(1) ) , but I don't know what went wrong.

  • Girish007
    Girish007 over 6 years
    for huge files, how to take care the memory in this case?