raise ValueError("np.nan is an invalid document, expected byte or "
Solution 1
replace NaN's with spaces - this should make CountVectorizer
happy:
X, y = df.CONTENT.fillna(' '), df.sentiment
Solution 2
What I can guess from your question is certain fields in the content are empty. You can follow the fillna method or drop the columns by df[df["Content"].notnull()]. This will give you the dataset where there are not NAN values.
Solution 3
You are not handling the NaN
, i.e. "not a number" aptly.
Use python's fillna()
method to fill/replace the missing or NaN
values in your pandas dataframe
with your desired value.
Hence, instead of:
X, y = df.CONTENT, df.sentiment
Use:
X, y = df.CONTENT.fillna(' '), df.sentiment
in which NaN
's are replaced by <spaces>
.
Sadhana Singh
student, love to coding, enthusiast learner, good reader, and listener
Updated on June 04, 2022Comments
-
Sadhana Singh almost 2 years
I am using
CountVectorizer()
in scikit-learn for vectorizing the feature sequence. I am receiving an error as below:ValueError: np.nan is an invalid document, expected byte or unicode string.
I am using an example csv dataset with two columns
CONTENT
andsentiment
.Here's my code:
df = pd.read_csv("train.csv",encoding = "ISO-8859-1") X, y = df.CONTENT, df.sentiment X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) print X_train, y_train vect = CountVectorizer(ngram_range=(1,3), analyzer='word', encoding = "ISO-8859-1") print vect X=vect.fit_transform(X_train, y_train) y=vect.fit(X_test) print vect.get_feature_names()
Here is the error message in full:
File "C:/Users/HP/cntVect.py", line 28, in <module> X=vect.fit_transform(X_train, y_train) File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform self.fixed_vocabulary_) File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab for feature in analyze(doc): File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 121, in decode raise ValueError("np.nan is an invalid document, expected byte or " ValueError: np.nan is an invalid document, expected byte or unicode string.