how to use word_tokenize in data frame
Solution 1
You can use apply method of DataFrame API:
import pandas as pd
import nltk
df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
Output:
>>> df
sentences \
0 This is a very good site. I will recommend it ...
1 Can you please give me a call at 9983938428. h...
2 good work! keep it up
tokenized_sents
0 [This, is, a, very, good, site, ., I, will, re...
1 [Can, you, please, give, me, a, call, at, 9983...
2 [good, work, !, keep, it, up]
For finding the length of each text try to use apply and lambda function again:
df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)
>>> df
sentences \
0 This is a very good site. I will recommend it ...
1 Can you please give me a call at 9983938428. h...
2 good work! keep it up
tokenized_sents sents_length
0 [This, is, a, very, good, site, ., I, will, re... 14
1 [Can, you, please, give, me, a, call, at, 9983... 15
2 [good, work, !, keep, it, up] 6
Solution 2
pandas.Series.apply is faster than pandas.DataFrame.apply
import pandas as pd
import nltk
df = pd.read_csv("/path/to/file.csv")
start = time.time()
df["unigrams"] = df["verbatim"].apply(nltk.word_tokenize)
print "series.apply", (time.time() - start)
start = time.time()
df["unigrams2"] = df.apply(lambda row: nltk.word_tokenize(row["verbatim"]), axis=1)
print "dataframe.apply", (time.time() - start)
On a sample 125 MB csv file,
series.apply 144.428858995
dataframe.apply 201.884778976
Edit: You could be thinking the Dataframe df after series.apply(nltk.word_tokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.word_tokenize).
Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.word_tokenize) separately.
Solution 3
I will show you an example. Suppose you have a data frame named twitter_df and you have stored sentiment and text within that. So, first I extract text data into a list as follows
tweetText = twitter_df['text']
then to tokenize
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
tweetText.head()
I think this will help you
Solution 4
May need to add str() to convert to pandas' object type to a string.
Keep in mind a faster way to count words is often to count spaces.
Interesting that tokenizer counts periods. May want to remove those first, maybe also remove numbers. Un-commenting the line below will result in equal counts, at least in this case.
import nltk
import pandas as pd
sentences = pd.Series([
'This is a very good site. I will recommend it to others.',
'Can you please give me a call at 9983938428. have issues with the listings.',
'good work! keep it up',
'not a very helpful site in finding home decor. '
])
# remove anything but characters and spaces
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)
# output: [['This', 'is', 'a', 'very', 'good', 'site', 'I', 'will', 'recommend', 'it', 'to', 'others'], ['Can', 'you', 'please', 'give', 'me', 'a', 'call', 'at', 'have', 'issues', 'with', 'the', 'listings'], ['good', 'work', 'keep', 'it', 'up'], ['not', 'a', 'very', 'helpful', 'site', 'in', 'finding', 'home', 'decor']]
wordcounts = [ len(words) for words in splitwords ]
print(wordcounts)
# output: [12, 13, 5, 9]
wordcounts2 = [ sentence.count(' ') + 1 for sentence in sentences ]
print(wordcounts2)
# output: [12, 13, 5, 9]
If you aren't using Pandas, you might not need str()
eclairs
Updated on July 09, 2022Comments
-
eclairs almost 2 years
I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.
data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.' 2. 'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings' 3. 'good','work','!','keep','it','up' 4. 'not','a','very','helpful','site','in','finding','home','decor'
Basically, i want to separate all the words and find the length of each text in the dataframe.
I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?
Please help!
Thanks in advance...
-
eclairs over 8 yearshow can we do this when there are multiple rows in the dataframe?
-
ilyakhov over 8 years@eclairs, what do you mean?
-
eclairs over 8 yearsI am getting this error message when trying to tokenize:
-
eclairs over 8 yearsI am getting this error message when trying to tokenize: TypeError: ('expected string or buffer', u'occurred at index 1')
-
ilyakhov over 8 yearsI have not enough information about your case, write in question what dataframe do you use exactly. In your question data have wrong format. Have you tried to use my code running all operations step by step? Have it worked on your machine?
-
eclairs over 8 yearsThere is one basic difference in your steps and mine. I have made a duplicate of another dataframe. i.e. my actual data frame is like: comment=pd.DataFrame(feedbacks, columns=['date','id', 'rating','comment','photos', 'home_info', 'neighbourhood', 'other_comment', 'uid', 'sid']).... from this, have created a duplicate- comments=comment[['comment']]... and then used the tokenize on this as - df['tokenized_words'] = comments.apply(lambda row: nltk.word_tokenize(row['comment']), axis=1).... Getting the error message at the last step...
-
ilyakhov over 8 yearsYou have to modify current question ("how to use word_tokenize in data frame") or ask new question, because the subject of your last comment is out of the question.