Count of most popular words in a pandas Dataframe

python-3.x csv pandas dataframe

13,762

Here is a NLTK solution, which ignores English stopwords (for example: in, on, of, the, etc.):

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

txt = df.plot_keywords.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

print('All frequencies, including STOPWORDS:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)

rslt = pd.DataFrame(words_except_stop_dist.most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')

matplotlib.style.use('ggplot')

rslt.plot.bar(rot=0)

Output:

All frequencies, including STOPWORDS:
============================================================
     Word  Frequency
0      in        339
1  female        301
2   title        289
3  nudity        259
4    love        248
5      on        240
6  school        238
7  friend        228
8      of        222
9     the        212
============================================================

Pandas solution, which uses stopwords from NLTK module:

from collections import Counter
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.plot_keywords
           .str.lower()
           .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
           .str.cat(sep=' ')
           .split()
)

# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(rslt)

# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)

Output:

        Frequency
Word
female        301
title         289
nudity        259
love          248
school        238
friend        228
police        210
male          205
death         195
sex           192

13,762

Author by

Admin

Updated on July 05, 2022

Comments

Admin almost 2 years
I use a csv data file containing movie data. In this dataset there is a column named plot_keywords.I want to find the 10 or 20 most popular keywords ,the number of times they show up and plotting them in a bar chart.To be more specific i copied 2 instances as they show up when i print the dataframe

9 blood|book|love|potion|professor

18 blackbeard|captain|pirate|revenge|soldier

I open the csv file as a pandas DataFrame.Here is the code i have so far
```
import pandas as pd
data=pd.read_csv('data.csv')
pd.Series(' '.join(data['plot_keywords']).lower().split()).value_counts()[:10]
```
None of other posts have helped me so far Thanks in advance

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset/kernels