count number of distinct words

java text-processing

18,901

Solution 1

I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words you need to count from the text, you could use a Java.Util.Scanner and then add the words to an ArrayList and if the word already exists in the list don't add it and then the size of the list would be the number of Distinct words, something like the example below:

public ArrayList<String> makeWordList(){
    Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
    ArrayList<String> listOfWords = new ArrayList<String>();

       String word = scan.next(); //scanner automatically uses " " as a delimeter
       if(!listOfWords.contains(word)){ //add the word if it isn't added already
            listOfWords.add(word);
    }

    return listOfWords; //return the list you made of distinct words
}

public int getDistinctWordCount(ArrayList<String> list){
    return list.size();
}

now if you actually have to count the number of characters in the word first before you add it to the list then you would just need to add some statements to check the length of the word string before adding it to the list. for example:

if(word.length() <= someNumber){
//do whatever you need to
}

Sorry if i'm not understanding the question and just gave some crappy unrelated answer =P but I hope it helps in some way!

if you needed to keep track of how often you see the same word, even though you only want to count it once, you could make a variable that keeps track of that frequency and put it in a list such that the index of the frequency count is the same as the index in the ArrayList so you know which word the frequency corresponds too or better yet use a HashMap where the key is the distinct word and the value is its frequency (basically use the same code as above but instead of ArrayList use HashMap and add in some variable to count the frequency:

 public HashMap<String, Integer> makeWordList(){
        Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
        HashMap<String, Integer> listOfWords = new HashMap<String, Integer>();
        Scanner scan = new Scanner(sc);
        while(cs.hasNext())
       {
            String word = scan.next(); //scanner automatically uses " " as a delimeter
            int countWord = 0;
            if(!listOfWords.containsKey(word))
            {                             //add word if it isn't added already
                listOfWords.put(word, 1); //first occurance of this word
            }
            else
            {
                countWord = listOfWords.get(word) + 1; //get current count and increment
                //now put the new value back in the HashMap
                listOfWords.remove(word); //first remove it (can't have duplicate keys)
                listOfWords.put(word, countWord); //now put it back with new value
            }
       }
        return listOfWrods; //return the HashMap you made of distinct words
    }

public int getDistinctWordCount(HashMap<String, Integer> list){
       return list.size();
}

//get the frequency of the given word
public int getFrequencyForWord(String word, HashMap<String, Integer> list){
    return list.get(word);
}

Solution 2

You can use a Multiset

split the string on space
create a new multiset from the result

Something like

String[] words = string.split(" ");
Multiset<String> wordCounts = HashMultiset.create(Arrays.asList(words));

Solution 3

There can be a many solutions for this problem, but one hat helped me, was as simple as below:

public static int countDistinctWords(String str){
        Set<String> noOWoInString = new HashSet<String>();
        String[] words = str.split(" ");
        //noOWoInString.addAll(words);
    for(String wrd:words){
        noOWoInString.add(wrd);
    }
    return noOWoInString.size();
}

Thanks, Sagar

18,901

Author by

Admin

Updated on June 06, 2022

Comments

Admin almost 2 years

I am trying to count the number of distinct words in the text, using Java.

The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts.