Python stemming (with pandas dataframe)
26,569
You have to apply the stemming on each word and store it into the "stemmed" column.
df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.
+----+--------------------------------------------------------------+
| | stemmed |
|----+--------------------------------------------------------------|
| 0 | ['program', 'program', 'with', 'program', 'languag'] |
| 1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpret'] |
+----+--------------------------------------------------------------+
Author by
Chiel
Updated on May 25, 2021Comments
-
Chiel almost 3 years
I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this?
import pandas as pd from nltk.stem.snowball import SnowballStemmer # Use English stemmer. stemmer = SnowballStemmer("english") # Sentences to be stemmed. data = ["programmers program with programming languages", "my code is working so there must be a bug in the interpreter"] # Create the Pandas dataFrame. df = pd.DataFrame(data, columns = ['unstemmed']) # Split the sentences to lists of words. df['unstemmed'] = df['unstemmed'].str.split() # Make sure we see the full column. pd.set_option('display.max_colwidth', -1) # Print dataframe. df +----+---------------------------------------------------------------+ | | unstemmed | |----+---------------------------------------------------------------| | 0 | ['programmers', 'program', 'with', 'programming', 'languages']| | 1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must', | | | 'be', 'a', 'bug', 'in', 'the', 'interpreter'] | +----+---------------------------------------------------------------+
-
Has QUIT--Anony-Mousse almost 8 yearsWhat is the type of this column? string (= sentences), or array of strings (= words)? Don't aplly the stemmer to a sentence, but one word at a time.
-
-
Chiel almost 8 yearsSorry if I am a little noobish, I am kinda new to Python as well as Stack Overflow.
-
arthur almost 8 yearsOk. It's because you did it inside a for loop. remove
for w in data[["stemmed"]]:
and it should work. -
arthur almost 8 yearsThe apply method is designed to apply a function on all rows/columns of a dataframe. So you don't have to iterate on rows/columns. For more informations you can have a look at the doc : link
-
Chiel almost 8 yearsAfter removing the first for loop I still get the same type of error:imgur.com/AUaaqmM
-
arthur almost 8 yearsCan you show me the dataframe just before executing the apply ?
-
arthur almost 8 yearsI would prefer a screenshot of your python console after :
print data
-
arthur almost 8 yearsyour
data['stemmed']
column has not exactly the same format as mine. I edit my answer -
Chiel almost 8 yearsThat's awesome! By the way I went from a ~70% accuracy to a 71.17 accuracy (after stemming) using the K-nearest neighbor algorithm, so it clearly helped.