Python stemming (with pandas dataframe)

26,569

You have to apply the stemming on each word and store it into the "stemmed" column.

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
|    | stemmed                                                      |
|----+--------------------------------------------------------------|
|  0 | ['program', 'program', 'with', 'program', 'languag']         |
|  1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must',          |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpret']                 |
+----+--------------------------------------------------------------+
Share:
26,569
Chiel
Author by

Chiel

Updated on May 25, 2021

Comments

  • Chiel
    Chiel almost 3 years

    I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this?

    import pandas as pd
    from nltk.stem.snowball import SnowballStemmer
    
    # Use English stemmer.
    stemmer = SnowballStemmer("english")
    
    # Sentences to be stemmed.
    data = ["programmers program with programming languages", "my code is working so there must be a bug in the interpreter"] 
        
    # Create the Pandas dataFrame.
    df = pd.DataFrame(data, columns = ['unstemmed']) 
    
    # Split the sentences to lists of words.
    df['unstemmed'] = df['unstemmed'].str.split()
    
    # Make sure we see the full column.
    pd.set_option('display.max_colwidth', -1)
    
    # Print dataframe.
    df 
    
    +----+---------------------------------------------------------------+
    |    | unstemmed                                                     |
    |----+---------------------------------------------------------------|
    |  0 | ['programmers', 'program', 'with', 'programming', 'languages']|
    |  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',        |  
    |    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']                |
    +----+---------------------------------------------------------------+
    
    • Has QUIT--Anony-Mousse
      Has QUIT--Anony-Mousse almost 8 years
      What is the type of this column? string (= sentences), or array of strings (= words)? Don't aplly the stemmer to a sentence, but one word at a time.
  • Chiel
    Chiel almost 8 years
    Sorry if I am a little noobish, I am kinda new to Python as well as Stack Overflow.
  • arthur
    arthur almost 8 years
    Ok. It's because you did it inside a for loop. remove for w in data[["stemmed"]]: and it should work.
  • arthur
    arthur almost 8 years
    The apply method is designed to apply a function on all rows/columns of a dataframe. So you don't have to iterate on rows/columns. For more informations you can have a look at the doc : link
  • Chiel
    Chiel almost 8 years
    After removing the first for loop I still get the same type of error:imgur.com/AUaaqmM
  • arthur
    arthur almost 8 years
    Can you show me the dataframe just before executing the apply ?
  • arthur
    arthur almost 8 years
    I would prefer a screenshot of your python console after : print data
  • arthur
    arthur almost 8 years
    your data['stemmed'] column has not exactly the same format as mine. I edit my answer
  • Chiel
    Chiel almost 8 years
    That's awesome! By the way I went from a ~70% accuracy to a 71.17 accuracy (after stemming) using the K-nearest neighbor algorithm, so it clearly helped.