Stopword removal with NLTK and Pandas

12,038

you are trying to do an inplace replace. you should do

   df['Title'] = df['Title'].apply(lambda x: [item for item in x if item not in stop])
    df['Body'] = df['Body'].apply(lambda x: [item for item in x if item not in stop])
Share:
12,038
slm
Author by

slm

Updated on July 19, 2022

Comments

  • slm
    slm almost 2 years

    I have some issues with Pandas and NLTK. I am new at programming, so excuse me if i ask questions that might be easy to solve. I have a csv file which has 3 columns(Id,Title,Body) and about 15.000 rows.

    My goal is to remove the stopwords from this csv file. The operation for lowercase and split are working well. But i can not find my mistake why the stopwords does not get removed. What am i missing?

        import pandas as pd
        from nltk.corpus import stopwords
    
        pd.read_csv("test10in.csv", encoding="utf-8") 
    
        df = pd.read_csv("test10in.csv") 
    
        df.columns = ['Id','Title','Body']
        df['Title'] = df['Title'].str.lower().str.split()  
        df['Body'] = df['Body'].str.lower().str.split() 
    
    
        stop = stopwords.words('english')
    
        df['Title'].apply(lambda x: [item for item in x if item not in stop])
        df['Body'].apply(lambda x: [item for item in x if item not in stop])
    
        df.to_csv("test10out.csv")
    
  • slm
    slm over 8 years
    Thank you very much. My lack of programming experience i guess :) you are the hero of the day.
  • AbtPst
    AbtPst over 8 years
    you're welcome. I recently started using nltk for text processing and so i have myself made these mistakes :) glad to help. also, check out Kaggle.com for some cool introductory tutorials on text analytics. all the best :)