AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

15,587

Solution 1

As you see the error is AttributeError: 'int' object has no attribute 'lower' which means integer cannot be lower-cased. Somewhere in your code, it tries to lower case integer object which is not possible.

Why this happens?

CountVectorizer constructor has parameter lowercase which is True by default. When you call .fit_transform() it tries to lower case your input that contains an integer. More specifically, in your input data, you have an item which is an integer object. E.g., your list contains data similar to:

 corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']

When you pass the above list to CountVectorizer it throws such exception.

How to fix it?

Here are some possible solution to avoid this problem:

1) Convert all rows in your corpus to string object.

 corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']
 corpus = [str (item) for item in corpus]

2) Remove integers in your corpus:

corpus = ['sentence1', 'sentence 2', 12930, 'sentence 100']
corpus = [item for item in corpus if not isinstance(item, int)]

Solution 2

You can set lowercase = False:

cv = CountVectorizer(lowercase=False)
Share:
15,587
hadi javanmard
Author by

hadi javanmard

Updated on June 26, 2022

Comments

  • hadi javanmard
    hadi javanmard almost 2 years

    I tried to predict different classes of the entry messages and I worked on the Persian language. I used Tfidf and Naive-Bayes to classify my input data. Here is my code:

    import pandas as pd
    df=pd.read_excel('dataset.xlsx')
    col=['label','body']
    df=df[col]
    df.columns=['label','body']
    df['class_type'] = df['label'].factorize()[0]
    class_type_df=df[['label','class_type']].drop_duplicates().sort_values('class_type')
    class_type_id = dict(class_type_df.values)
    id_to_class_type = dict(class_type_df[['class_type', 'label']].values)
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer()
    features=tfidf.fit_transform(df.body).toarray()
    classtype=df.class_type
    print(features.shape)
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB 
    X_train,X_test,y_train,y_test=train_test_split(df['body'],df['label'],random_state=0)
    cv=CountVectorizer()
    X_train_counts=cv.fit_transform(X_train)
    tfidf_transformer=TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    clf = MultinomialNB().fit(X_train_tfidf, y_train)
    print(clf.predict(cv.transform(["خريد و فروش لوازم آرايشي از بانه"])))
    

    But when I run the above code it throws the following exception while I expect to give me "ads" class in the output:

    Traceback (most recent call last): File ".../multiclass-main.py", line 27, in X_train_counts=cv.fit_transform(X_train) File "...\sklearn\feature_extraction\text.py", line 1012, in fit_transform self.fixed_vocabulary_) File "...sklearn\feature_extraction\text.py", line 922, in _count_vocab for feature in analyze(doc): File "...sklearn\feature_extraction\text.py", line 308, in tokenize(preprocess(self.decode(doc))), stop_words) File "...sklearn\feature_extraction\text.py", line 256, in return lambda x: strip_accents(x.lower()) AttributeError: 'int' object has no attribute 'lower'

    how can I use Tfidf and CountVectorizer in this project?

  • hadi javanmard
    hadi javanmard over 5 years
    thanks for answering, i found out that Persian statement should be encoded to be able to have process on them. but i don't know how to fix this problem. @amir
  • Amir
    Amir over 5 years
    @hadijavanmard in python 3 is not needed. Just preprocess your Dataframe as above.
  • hadi javanmard
    hadi javanmard over 5 years
    I cant do solution no.2 because they are the sentences which i want to make my Dataframe with. what do you mean by preprocess the Dataframe? I can't understand solution no.1. what should i do ? @amir
  • Amir
    Amir over 5 years
    replace corpus in my code with X_train would fix your issue.