raise ValueError("np.nan is an invalid document, expected byte or "

11,309

Solution 1

replace NaN's with spaces - this should make CountVectorizer happy:

X, y = df.CONTENT.fillna(' '), df.sentiment

Solution 2

What I can guess from your question is certain fields in the content are empty. You can follow the fillna method or drop the columns by df[df["Content"].notnull()]. This will give you the dataset where there are not NAN values.

Solution 3

You are not handling the NaN, i.e. "not a number" aptly. Use python's fillna() method to fill/replace the missing or NaN values in your pandas dataframe with your desired value.

Hence, instead of:

X, y = df.CONTENT, df.sentiment

Use:

X, y = df.CONTENT.fillna(' '), df.sentiment

in which NaN's are replaced by <spaces>.

Share:
11,309
Sadhana Singh
Author by

Sadhana Singh

student, love to coding, enthusiast learner, good reader, and listener

Updated on June 04, 2022

Comments

  • Sadhana Singh
    Sadhana Singh almost 2 years

    I am using CountVectorizer() in scikit-learn for vectorizing the feature sequence. I am receiving an error as below:

    ValueError: np.nan is an invalid document, expected byte or unicode string.
    

    I am using an example csv dataset with two columns CONTENT and sentiment.

    Here's my code:

    df = pd.read_csv("train.csv",encoding = "ISO-8859-1")
    X, y = df.CONTENT, df.sentiment
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    print X_train, y_train
    
    vect = CountVectorizer(ngram_range=(1,3), analyzer='word', encoding = "ISO-8859-1")
    print vect
    X=vect.fit_transform(X_train, y_train)
    y=vect.fit(X_test) 
    print vect.get_feature_names()
    

    Here is the error message in full:

    File "C:/Users/HP/cntVect.py", line 28, in <module>
        X=vect.fit_transform(X_train, y_train)
    
      File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
        self.fixed_vocabulary_)
    
      File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
        for feature in analyze(doc):
    
      File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
        tokenize(preprocess(self.decode(doc))), stop_words)
    
      File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 121, in decode
        raise ValueError("np.nan is an invalid document, expected byte or "
    
    ValueError: np.nan is an invalid document, expected byte or unicode string.