How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

32,895

In short:

df['Text'].apply(word_tokenize)

Or if you want to add another column to store the tokenized list of strings:

df['tokenized_text'] = df['Text'].apply(word_tokenize) 

There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

To use nltk.tokenize.TweetTokenizer:

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)

Similar to:

Share:
32,895
Vic13
Author by

Vic13

Coder| Anime lover | Football Fanatic

Updated on August 22, 2020

Comments

  • Vic13
    Vic13 over 3 years

    This is the Code that I am using for semantic analysis of twitter:-

    import pandas as pd
    import datetime
    import numpy as np
    import re
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk.stem.porter import PorterStemmer
    
    df=pd.read_csv('twitDB.csv',header=None, 
    sep=',',error_bad_lines=False,encoding='utf-8')
    
    hula=df[[0,1,2,3]]
    hula=hula.fillna(0)
    hula['tweet'] = hula[0].astype(str) 
    +hula[1].astype(str)+hula[2].astype(str)+hula[3].astype(str) 
    hula["tweet"]=hula.tweet.str.lower()
    
    ho=hula["tweet"]
    ho = ho.replace('\s+', ' ', regex=True) 
    ho=ho.replace('\.+', '.', regex=True)
    special_char_list = [':', ';', '?', '}', ')', '{', '(']
    for special_char in special_char_list:
    ho=ho.replace(special_char, '')
    print(ho)
    
    ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
    ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
    ho =ho.replace('\'"',regex=True)
    
    lem = WordNetLemmatizer()
    stem = PorterStemmer()
    fg=stem.stem(a)
    
    eng_stopwords = stopwords.words('english') 
    ho = ho.to_frame(name=None)
    a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
    index=True, na_rep='NaN', formatters=None, float_format=None, 
    sparsify=False, index_names=True, justify=None, line_width=None, 
    max_rows=None, max_cols=None, show_dimensions=False)
    wordList = word_tokenize(fg)                                     
    wordList = [word for word in wordList if word not in eng_stopwords]  
    print (wordList)
    

    Input i.e. a :-

                                                  tweet
    0     1495596971.6034188::automotive auto ebc greens...
    1     1495596972.330948::new free stock photo of cit...
    

    getting output ( wordList) in this format:-

    tweet
     0
    1495596971.6034188
    :
    :automotive
    auto
    

    I want the output of a row in a row format only. How can I do it? If you have a better code for semantic analysis of twitter please share it with me.

  • alvas
    alvas almost 7 years
    I'm glad the answer helped.
  • alvas
    alvas almost 7 years
    Your questions are going to get closed easily if you don't strip the irrelevant parts of your code and only post information crucial to your question. Make edits to the new question you ask ;P
  • Vic13
    Vic13 almost 7 years
    Sure, will do that and ask again. Thanks :)
  • bernando_vialli
    bernando_vialli over 5 years
    @alvas, do you know why I am getting: TypeError: expected string or bytes-like object when running your code above on my pandas dataframe column with text. My only difference is I am using sent_tokenizer to just split into sentences as opposed to words