To count the frequency of each word

16,279

Solution 1

Here is a solution that should count all the word frequencies in a file:

    private void countWordsInFile(string file, Dictionary<string, int> words)
    {
        var content = File.ReadAllText(file);

        var wordPattern = new Regex(@"\w+");

        foreach (Match match in wordPattern.Matches(content))
        {
            int currentCount=0;
            words.TryGetValue(match.Value, out currentCount);

            currentCount++;
            words[match.Value] = currentCount;
        }
    }

You can call this code like this:

        var words = new Dictionary<string, int>(StringComparer.CurrentCultureIgnoreCase);

        countWordsInFile("file1.txt", words);

After this words will contain all words in the file with their frequency (e.g. words["test"] returns the number of times that "test" is in the file content. If you need to accumulate the results from more than one file, simply call the method for all files with the same dictionary. If you need separate results for each file then create a new dictionary each time and use a structure like @DarkGray suggested.

Solution 2

There is a Linq-ish alternative which imo is simpler. The key here is to use the framework built in File.ReadLines (which is lazily read which is cool) and string.Split.

private Dictionary<string, int> GetWordFrequency(string file)
{
    return File.ReadLines(file)
               .SelectMany(x => x.Split())
               .Where(x => x != string.Empty)
               .GroupBy(x => x)
               .ToDictionary(x => x.Key, x => x.Count());
}

To get frequencies from many files, you can have an overload based on params.

private Dictionary<string, int> GetWordFrequency(params string[] files)
{
    return files.SelectMany(x => File.ReadLines(x))
                .SelectMany(x => x.Split())
                .Where(x => x != string.Empty)
                .GroupBy(x => x)
                .ToDictionary(x => x.Key, x => x.Count());
}
Share:
16,279
Admin
Author by

Admin

Updated on July 02, 2022

Comments

  • Admin
    Admin almost 2 years

    There's a directory with a few text files. How do I count the frequency of each word in each file? A word means a set of characters that can contain the letters, the digits and the underlining characters.

  • Admin
    Admin about 12 years
    Does this regex allow a set of characters that can contain the letters, the digits and the underlining characters only? And which generic container should I use to store information about the words, the count frequencies and the files?
  • Serj-Tm
    Serj-Tm about 12 years
    @Grienders Check current variant
  • Admin
    Admin about 12 years
    what does your code do? It does not do what I need! Does it count the frequency of each word or does it count the amount of all the words?
  • Mayank Singh
    Mayank Singh almost 5 years
    Keep sending filename to this piece of code to find the frequency for each file.