How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?
32,895
In short:
df['Text'].apply(word_tokenize)
Or if you want to add another column to store the tokenized list of strings:
df['tokenized_text'] = df['Text'].apply(word_tokenize)
There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual
To use nltk.tokenize.TweetTokenizer
:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)
Similar to:
How to apply pos_tag_sents() to pandas dataframe efficiently
How to apply pos_tag_sents() to pandas dataframe efficiently
Comments
-
Vic13 over 3 years
This is the Code that I am using for semantic analysis of twitter:-
import pandas as pd import datetime import numpy as np import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer from nltk.stem.porter import PorterStemmer df=pd.read_csv('twitDB.csv',header=None, sep=',',error_bad_lines=False,encoding='utf-8') hula=df[[0,1,2,3]] hula=hula.fillna(0) hula['tweet'] = hula[0].astype(str) +hula[1].astype(str)+hula[2].astype(str)+hula[3].astype(str) hula["tweet"]=hula.tweet.str.lower() ho=hula["tweet"] ho = ho.replace('\s+', ' ', regex=True) ho=ho.replace('\.+', '.', regex=True) special_char_list = [':', ';', '?', '}', ')', '{', '('] for special_char in special_char_list: ho=ho.replace(special_char, '') print(ho) ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True) ho =ho.replace(r'#([^\s]+)', r'\1', regex=True) ho =ho.replace('\'"',regex=True) lem = WordNetLemmatizer() stem = PorterStemmer() fg=stem.stem(a) eng_stopwords = stopwords.words('english') ho = ho.to_frame(name=None) a=ho.to_string(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=False, index_names=True, justify=None, line_width=None, max_rows=None, max_cols=None, show_dimensions=False) wordList = word_tokenize(fg) wordList = [word for word in wordList if word not in eng_stopwords] print (wordList)
Input i.e. a :-
tweet 0 1495596971.6034188::automotive auto ebc greens... 1 1495596972.330948::new free stock photo of cit...
getting output ( wordList) in this format:-
tweet 0 1495596971.6034188 : :automotive auto
I want the output of a row in a row format only. How can I do it? If you have a better code for semantic analysis of twitter please share it with me.
-
alvas almost 7 yearsI'm glad the answer helped.
-
alvas almost 7 yearsYour questions are going to get closed easily if you don't strip the irrelevant parts of your code and only post information crucial to your question. Make edits to the new question you ask ;P
-
Vic13 almost 7 yearsSure, will do that and ask again. Thanks :)
-
bernando_vialli over 5 years@alvas, do you know why I am getting: TypeError: expected string or bytes-like object when running your code above on my pandas dataframe column with text. My only difference is I am using sent_tokenizer to just split into sentences as opposed to words