efficient Term Document Matrix with NLTK

44,588

Solution 1

Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.

I post it here in the hope that someone else will find it useful.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

to use it on a list of text in a directory

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

to create the dataframe

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')  

Solution 2

I know the OP wanted to create a tdm in NLTK, but the textmining package (pip install textmining) makes it dead simple:

import textmining
    
# Create some very short sample documents
doc1 = 'John and Bob are brothers.'
doc2 = 'John went to the store. The store was closed.'
doc3 = 'Bob went to the store too.'

# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()

# Add the documents
tdm.add_doc(doc1)
tdm.add_doc(doc2)
tdm.add_doc(doc3)

# Write matrix file -- cutoff=1 means words in 1+ documents are retained
tdm.write_csv('matrix.csv', cutoff=1)

# Instead of writing the matrix, access its rows directly
for row in tdm.rows(cutoff=1):
    print row

Output:

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

Alternatively, one can use pandas and sklearn [source]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

Output:

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0

Solution 3

An Alternative approach using tokens and Data Frame

import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

tokens = nltk.word_tokenize(raw)
type(tokens)

tokens[1:10]
['Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()


Words
0   The
1   Project
2   Gutenberg
3   EBook
4   of

    tokens2.Words.value_counts().head()
,                 16178
.                  9589
the                7436
and                6284
to                 5278
Share:
44,588
user1043144
Author by

user1043144

Updated on July 17, 2022

Comments

  • user1043144
    user1043144 almost 2 years

    I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

    def fnDTM_Corpus(xCorpus):
        import pandas as pd
        '''to create a Term Document Matrix from a NLTK Corpus'''
        fd_list = []
        for x in range(0, len(xCorpus.fileids())):
            fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
        DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
        DTM.fillna(0,inplace = True)
        return DTM.T
    

    to run it

    import nltk
    from nltk.corpus import PlaintextCorpusReader
    corpus_root = 'C:/Data/'
    
    newcorpus = PlaintextCorpusReader(corpus_root, '.*')
    
    x = fnDTM_Corpus(newcorpus)
    

    It works well for few small files in the corpus but gives me a MemoryError when I try to run it with a corpus of 4,000 files (of about 2 kb each).

    Am I missing something?

    I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?

  • Duong Trung Nghia
    Duong Trung Nghia over 7 years
    I got an error when running your code: import stemmer ImportError: No module named 'stemmer' How can I fix it? I already tried pip install stemmer.
  • duhaime
    duhaime over 7 years
    What version of Python are you on? It's possible there's a stemmer module import within the textmining package that's flailing. I just ran pip install textmining then ran the code above on 2.7.9 and got the expected output.
  • Duong Trung Nghia
    Duong Trung Nghia over 7 years
    I use python 3.5, anaconda, windows 10. I ran pip install textmining. I copied and ran the code as it is.
  • duhaime
    duhaime over 7 years
    It's possible the textmining module has a hard dependency on python 2.7. Could you try conda create -n myvirtualenv python=2.7 then source activate myvirtualenv then repeat the pip install and try the script again, inside the conda environment? Once you're done with the environment just type source deactivate and then you'll have access to your system level python 3.5 environment
  • MERose
    MERose over 6 years
    Yes, I think there's a problem for Python3 users. I filed an issue for that.