Insert result of sklearn CountVectorizer in a pandas dataframe

11,218

Return term-document matrix after learning the vocab dictionary from the raw documents.

X = vect.fit_transform(docs) 

Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.

count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())

Concatenate the original df and the count_vect_df columnwise.

pd.concat([df, count_vect_df], axis=1)
Share:
11,218

Related videos on Youtube

Saurabh Sood
Author by

Saurabh Sood

Updated on September 23, 2022

Comments

  • Saurabh Sood
    Saurabh Sood over 1 year

    I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:

    vectorizer = CountVectorizer
    features = vectorizer.fit_transform(examples)
    

    where examples is an array of all the text documents

    Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).

    What would be a good way to insert the vectorized features into the pandas dataframe?

  • Saurabh Sood
    Saurabh Sood over 7 years
    so, in your second line, are you creating a dataframe of the vectorized features? If so, that does not work for me. I get the following error: PandasError: DataFrame constructor not properly called! I used: features_df = pd.DataFrame(res) where res is the result of CountVectorizer fit_transform method
  • Tchotchke
    Tchotchke over 7 years
    I like those additions in your second line - I'll be incorporating them in some of the projects I've been working on!