How to calculate number of words in a string in DataFrame?

44,832

Solution 1

IIUC then you can do the following:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.

We then rename the index and sort it to get the desired output

UPDATE

This can also be done using str.len rather than apply which should scale better:

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

Timings

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

For a 6K df:

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

Solution 2

You could use str.count with space ' ' as delimiter.

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Timings

str.count is marginally faster

Small

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop

Medium

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

Large

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop
Share:
44,832

Related videos on Youtube

Sergei
Author by

Sergei

Updated on October 22, 2020

Comments

  • Sergei
    Sergei over 3 years

    Suppose we have simple Dataframe

    df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
    df.columns = ['fruits']
    

    how to calculate number of words in keywords, similar to:

    1 word: 2
    2 words: 2
    3 words: 1
    4 words: 1
    
  • Lucecpkn
    Lucecpkn about 3 years
    At first I was confused how could we use str.len after the str.split which turns each element into a list instead of str. So, I checked the doc and realized the function works as long as the element is of list-like types. Hope this clarification might be helpful for laymen like me who had the same question.