Counting frequency of words from a .txt file in java

13,308

Solution 1

Let me combine all the good answers here.

1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.

2) When you split think deeply about how you want to split. As @m0skit0 suggest you should likely split with \b for this problem.

3) As @jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.

4) To print out the map in the way you likely expect, take a look at the below:

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

Solution 2

I would have expected something more like this. Does it make sense?

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.

If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

Solution 3

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".

You can avoid this by splitting by word boundary (\b) instead.

line.split("\\b");

Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.

And pro tip: always print/log the full stacktrace for the exceptions.

Share:
13,308
Kommander Kitten
Author by

Kommander Kitten

Updated on June 14, 2022

Comments

  • Kommander Kitten
    Kommander Kitten almost 2 years

    I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.

    I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).

    This is the code I came up with for the method:

    public static HashMap<String, Integer> countWords(String directory, int nFiles) {
    // Declare the HashMap
    HashMap<String, Integer> wordCount = new HashMap();
    
    // this large 'for' loop will go through each file in the specified directory.
    for (int k = 1; k < nFiles; k++) {
      // Puts together the string that the FileReader will refer to.
      String learn = directory + k + ".txt";
    
    try {
      FileReader reader = new FileReader(learn);
      BufferedReader br = new BufferedReader(reader);
      // The BufferedReader reads the lines
    
      String line = br.readLine();
    
    
      // Split the line into a String array to loop through
      String[] words = line.split(" ");
      int freq = 0;
    
      // for loop goes through every word
      for (int i = 0; i < words.length; i++) {
        // Case if the HashMap already contains the key.
        // If so, just increments the value
    
        if (wordCount.containsKey(words[i])) {         
          wordCount.put(words[i], freq++);
        }
        // Otherwise, puts the word into the HashMap
        else {
          wordCount.put(words[i], freq++);
        }
      }
      // Catching the file not found error
      // and any other errors
    }
    catch (FileNotFoundException fnfe) {
      System.err.println("File not found.");
    }
    catch (Exception e) {
      System.err.print(e);
       }
     }
    return wordCount;
    }
    

    The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.

    If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.