TFIDF Vectorizer giving error

25,715

the input to the tokenizer paramter is a callable. Try defining a function that will tokenize your data appropriately. If it is comma delimited then:

def tokens(x):
return x.split(',')

should work.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer( tokenizer=tokens ,use_idf=True, smooth_idf=True, sublinear_tf=False)

create a random string delimited by ,

 a=['cat on the,angel eyes has,blue red angel,one two blue,blue whales eat,hot tin roof']

tfidf_vect.fit_transform(a)
tfidf_vect.get_feature_names()

returns

Out[73]:

[u'angel eyes has',
 u'blue red angel',
 u'blue whales eat',
 u'cat on the',
 u'hot tin roof',
 u'one two blue']
Share:
25,715
Axe
Author by

Axe

Updated on June 05, 2020

Comments

  • Axe
    Axe almost 4 years

    I am trying to carry out text classification for certain files using TFIDF and SVM. The features are to be selected 3 words at a time . My data files is already in the format : angel eyes has, each one for, on its own. There are no stop words and neither can do lemming or stemming. I want the feature to be selected as: angel eyes has ... The code that I have written is given below:

    import os
    import sys
    import numpy
    from sklearn.svm import LinearSVC
    from sklearn.metrics import confusion_matrix
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn import metrics
    from sklearn.datasets import load_files
    from sklearn.cross_validation import train_test_split
    
    dt=load_files('C:/test4',load_content=True)
    d= len(dt)
    print dt.target_names
    X, y = dt.data, dt.target
    print y
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    print y_train
    vectorizer = CountVectorizer()
    z= vectorizer.fit_transform(X_train)
    tfidf_vect= TfidfVectorizer(lowercase= True, tokenizer=',', max_df=1.0, min_df=1, max_features=None, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
    
    
    X_train_tfidf = tfidf_vect.fit_transform(z)
    
    print tfidf_vect.get_feature_names()
    svm_classifier = LinearSVC().fit(X_train_tfidf, y_train)
    

    Unfortunately I am getting an error at" X_train_tfidf = tfidf_vect.fit_transform(z)" : AttributeError: lower not found .
    If I modifiy code to do

    tfidf_vect= TfidfVectorizer( tokenizer=',', use_idf=True, smooth_idf=True, sublinear_tf=False)
    print "okay2"
    #X_train_tfidf = tfidf_transformer.fit_transform(z)
    X_train_tfidf = tfidf_vect.fit_transform(X_train)
    print X_train_tfidf.getfeature_names()
    

    I get the error : TypeError: 'str' object is not callable Can please someone tell me where am I going wrong

    • JAB
      JAB over 9 years
      whats happens if you remove the tokenizer parameter?
  • Axe
    Axe over 9 years
    Thank you very much. It worked. But I don't understand why was it not working when I set the tokenizer. I am asking just for knowledge sake
  • JAB
    JAB over 9 years
    when you were passing the string ',' directly to the tokenizer it was trying to call the string. You need to pass a function that tokenizes the data. Is this what you mean?