how to use word_tokenize in data frame

python pandas nltk

77,156

Solution 1

You can use apply method of DataFrame API:

import pandas as pd
import nltk

df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

Output:

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  
0  [This, is, a, very, good, site, ., I, will, re...  
1  [Can, you, please, give, me, a, call, at, 9983...  
2                      [good, work, !, keep, it, up]

For finding the length of each text try to use apply and lambda function again:

df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  sents_length  
0  [This, is, a, very, good, site, ., I, will, re...            14  
1  [Can, you, please, give, me, a, call, at, 9983...            15  
2                      [good, work, !, keep, it, up]             6

Solution 2

pandas.Series.apply is faster than pandas.DataFrame.apply

import pandas as pd
import nltk

df = pd.read_csv("/path/to/file.csv")

start = time.time()
df["unigrams"] = df["verbatim"].apply(nltk.word_tokenize)
print "series.apply", (time.time() - start)

start = time.time()
df["unigrams2"] = df.apply(lambda row: nltk.word_tokenize(row["verbatim"]), axis=1)
print "dataframe.apply", (time.time() - start)

On a sample 125 MB csv file,

series.apply 144.428858995

dataframe.apply 201.884778976

Edit: You could be thinking the Dataframe df after series.apply(nltk.word_tokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.word_tokenize).

Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.word_tokenize) separately.

Solution 3

I will show you an example. Suppose you have a data frame named twitter_df and you have stored sentiment and text within that. So, first I extract text data into a list as follows

 tweetText = twitter_df['text']

then to tokenize

 from nltk.tokenize import word_tokenize

 tweetText = tweetText.apply(word_tokenize)
 tweetText.head()

I think this will help you

Solution 4

May need to add str() to convert to pandas' object type to a string.

Keep in mind a faster way to count words is often to count spaces.

Interesting that tokenizer counts periods. May want to remove those first, maybe also remove numbers. Un-commenting the line below will result in equal counts, at least in this case.

import nltk
import pandas as pd

sentences = pd.Series([ 
    'This is a very good site. I will recommend it to others.',
    'Can you please give me a call at 9983938428. have issues with the listings.',
    'good work! keep it up',
    'not a very helpful site in finding home decor. '
])

# remove anything but characters and spaces
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()

splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)
    # output: [['This', 'is', 'a', 'very', 'good', 'site', 'I', 'will', 'recommend', 'it', 'to', 'others'], ['Can', 'you', 'please', 'give', 'me', 'a', 'call', 'at', 'have', 'issues', 'with', 'the', 'listings'], ['good', 'work', 'keep', 'it', 'up'], ['not', 'a', 'very', 'helpful', 'site', 'in', 'finding', 'home', 'decor']]

wordcounts = [ len(words) for words in splitwords ]
print(wordcounts)
    # output: [12, 13, 5, 9]

wordcounts2 = [ sentence.count(' ') + 1 for sentence in sentences ]
print(wordcounts2)
    # output: [12, 13, 5, 9]

If you aren't using Pandas, you might not need str()

View more solutions

77,156

Author by

eclairs

Updated on July 09, 2022

Comments

eclairs almost 2 years

I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.

data example:
       text
1.   This is a very good site. I will recommend it to others.
2.   Can you please give me a call at 9983938428. have issues with the listings.
3.   good work! keep it up
4.   not a very helpful site in finding home decor. 

expected output:

1.   'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
2.   'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
3.   'good','work','!','keep','it','up'
4.   'not','a','very','helpful','site','in','finding','home','decor'

Basically, i want to separate all the words and find the length of each text in the dataframe.

I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?

Please help!

Thanks in advance...

eclairs over 8 years

how can we do this when there are multiple rows in the dataframe?
ilyakhov over 8 years

@eclairs, what do you mean?
eclairs over 8 years

I am getting this error message when trying to tokenize:
eclairs over 8 years

I am getting this error message when trying to tokenize: TypeError: ('expected string or buffer', u'occurred at index 1')
ilyakhov over 8 years

I have not enough information about your case, write in question what dataframe do you use exactly. In your question data have wrong format. Have you tried to use my code running all operations step by step? Have it worked on your machine?
eclairs over 8 years

There is one basic difference in your steps and mine. I have made a duplicate of another dataframe. i.e. my actual data frame is like: comment=pd.DataFrame(feedbacks, columns=['date','id', 'rating','comment','photos', 'home_info', 'neighbourhood', 'other_comment', 'uid', 'sid']).... from this, have created a duplicate- comments=comment[['comment']]... and then used the tokenize on this as - df['tokenized_words'] = comments.apply(lambda row: nltk.word_tokenize(row['comment']), axis=1).... Getting the error message at the last step...
ilyakhov over 8 years

You have to modify current question ("how to use word_tokenize in data frame") or ask new question, because the subject of your last comment is out of the question.