Insert result of sklearn CountVectorizer in a pandas dataframe
Return term-document matrix after learning the vocab dictionary from the raw documents.
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())
Concatenate the original df
and the count_vect_df
columnwise.
pd.concat([df, count_vect_df], axis=1)
Related videos on Youtube
Saurabh Sood
Updated on September 23, 2022Comments
-
Saurabh Sood over 1 year
I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the
CountVectorizer
in sklearn, to convert the documents to feature vectors. I did this by calling:vectorizer = CountVectorizer features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape
(14784, 5)
. The shape of my feature vector is(14784, 21343)
.What would be a good way to insert the vectorized features into the pandas dataframe?
-
Saurabh Sood over 7 yearsso, in your second line, are you creating a dataframe of the vectorized features? If so, that does not work for me. I get the following error:
PandasError: DataFrame constructor not properly called!
I used:features_df = pd.DataFrame(res)
where res is the result of CountVectorizerfit_transform
method -
Tchotchke over 7 yearsI like those additions in your second line - I'll be incorporating them in some of the projects I've been working on!