How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

python pandas twitter nltk tokenize

32,895

In short:

df['Text'].apply(word_tokenize)

Or if you want to add another column to store the tokenized list of strings:

df['tokenized_text'] = df['Text'].apply(word_tokenize)

There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

To use nltk.tokenize.TweetTokenizer:

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)

Similar to:

32,895

Author by

Vic13

Coder| Anime lover | Football Fanatic

Updated on August 22, 2020

Comments

Vic13 over 3 years

This is the Code that I am using for semantic analysis of twitter:-

import pandas as pd
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

df=pd.read_csv('twitDB.csv',header=None, 
sep=',',error_bad_lines=False,encoding='utf-8')

hula=df[[0,1,2,3]]
hula=hula.fillna(0)
hula['tweet'] = hula[0].astype(str) 
+hula[1].astype(str)+hula[2].astype(str)+hula[3].astype(str) 
hula["tweet"]=hula.tweet.str.lower()

ho=hula["tweet"]
ho = ho.replace('\s+', ' ', regex=True) 
ho=ho.replace('\.+', '.', regex=True)
special_char_list = [':', ';', '?', '}', ')', '{', '(']
for special_char in special_char_list:
ho=ho.replace(special_char, '')
print(ho)

ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)

lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)

eng_stopwords = stopwords.words('english') 
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)                                     
wordList = [word for word in wordList if word not in eng_stopwords]  
print (wordList)

Input i.e. a :-

                                              tweet
0     1495596971.6034188::automotive auto ebc greens...
1     1495596972.330948::new free stock photo of cit...

getting output ( wordList) in this format:-

tweet
 0
1495596971.6034188
:
:automotive
auto

I want the output of a row in a row format only. How can I do it? If you have a better code for semantic analysis of twitter please share it with me.

alvas almost 7 years

I'm glad the answer helped.
alvas almost 7 years

Your questions are going to get closed easily if you don't strip the irrelevant parts of your code and only post information crucial to your question. Make edits to the new question you ask ;P
Vic13 almost 7 years

Sure, will do that and ask again. Thanks :)
bernando_vialli over 5 years

@alvas, do you know why I am getting: TypeError: expected string or bytes-like object when running your code above on my pandas dataframe column with text. My only difference is I am using sent_tokenizer to just split into sentences as opposed to words