How to stem words in python list?

39,894

Solution 1

from stemming.porter2 import stem

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]

What we are doing here is using a list comprehension to loop through each string inside the main list, splitting that into a list of words. Then we loop through that list, stemming each word as we go, returning the new list of stemmed words.

Please note I haven't tried this with stemming installed - I have taken that from the comments, and have never used it myself. This is, however, the basic concept for splitting the list into words. Note that this will produce a list of lists of words, keeping the original separation.

If do not want this separation, you can do:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]

Instead, which will leave you with one continuous list.

If you wish to join the words back together at the end, you can do:

documents = [" ".join(sentence) for sentence in documents]

or to do it in one line:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]

Where keeping the sentence structure, or

documents = " ".join(documents)

Where ignoring it.

Solution 2

You might want to have a look at the NLTK (Natural Language ToolKit). It has a module nltk.stem which contains various different stemmers.

See also this question.

Solution 3

Alright. So, using the stemming package, you'd have something like this:

from stemming.porter2 import stem
from itertools import chain

def flatten(listOfLists):
    "Flatten one level of nesting"
    return list(chain.from_iterable(listOfLists))

def stemall(documents):
    return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])

Solution 4

you can use NLTK :

from nltk.stem import PorterStemmer


ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTK has many features for IR Systems, check it

Solution 5

from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]
Share:
39,894
ChamingaD
Author by

ChamingaD

Updated on September 26, 2020

Comments

  • ChamingaD
    ChamingaD over 3 years

    I have python list like below

    documents = ["Human machine interface for lab abc computer applications",
                 "A survey of user opinion of computer system response time",
                 "The EPS user interface management system",
                 "System and human system engineering testing of EPS",
                 "Relation of user perceived response time to error measurement",
                 "The generation of random binary unordered trees",
                 "The intersection graph of paths in trees",
                 "Graph minors IV Widths of trees and well quasi ordering",
                 "Graph minors A survey"]
    

    Now i need to stem it (each word) and get another list. How do i do that ?

  • ChamingaD
    ChamingaD about 12 years
    Thanks :) can i know how to iterate through whole list and stem all words ?
  • DSM
    DSM about 12 years
    That won't work; each "word" in your listcomp will be a list.
  • Niklas B.
    Niklas B. about 12 years
    @ChamingaD: words = [w for line in documents for w in line.split()]. Or even words = ' '.join(documents).split()
  • ChamingaD
    ChamingaD about 12 years
    Thanks. it stems but splits each word in list. ['comput', 'compil', 'translat', 'sourc', 'code', 'into', 'object', 'code,', 'while', 'interpret', 'execut', 'the', 'program'] ['A', 'compil', 'compil', 'your', 'code', 'into', 'a', '"runable"', 'applic', '(e.g:', 'a', '.ex', 'file)', 'where', 'as', 'an', 'intepret', 'run', 'the', 'sourc', 'code', 'as', 'it', 'goe']
  • ChamingaD
    ChamingaD about 12 years
    How can i stop splitting each word in final list ?
  • Gareth Latty
    Gareth Latty about 12 years
    @ChamingaD Edited to include a way to rejoin the lists.
  • Tuan Anh Hoang-Vu
    Tuan Anh Hoang-Vu about 11 years
    By joining them together using " ".join(list_of_words)
  • Max
    Max about 10 years
    Is stemming no longer a package in Python 3?
  • bernando_vialli
    bernando_vialli over 6 years
    @Max, did you ever figure out whether its a package in python3? I seem to be having issues with it now