CountVectorizer method get_feature_names() produces codes but not words

pandas machine-learning scikit-learn nlp countvectorizer

10,825

You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.

Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.

10,825

Author by

Dmitrij Burlaj

Updated on June 05, 2022

Comments

Dmitrij Burlaj almost 2 years
I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
```
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
```
And I got the following output:
```
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
```
and so on.

I need real feature names (words), not these codes. Can anybody help me please?

UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
- BENY over 6 years
  
  show your data.head()
- Dmitrij Burlaj over 6 years
  
  Entire dataframe?
- Dmitrij Burlaj over 6 years
  
  In fact, many of my words (features) become connected (without whitespace). I think it decreases the prediction accuracy of my model. What can cause this problem?
- Boubacar Traoré almost 5 years
  
  This question deserve more attention from scikit-learn community. This problem still exists and I waste the whole day trying to figure out what was wrong in my code ! This should have been fixed !
Dmitrij Burlaj over 6 years

I already changed it to min_df=2, but this did not solve the problem. I am afraid that when I try to increase this hyperparameter more, then I can lose some important words, which can be very useful in the further classification of the text. I think I need to preprocess the text in some way instead, but I don't know how to do this.
dhanush-ai1990 over 6 years

Can you try using Spacy to preprocess the text ?. Spacy is a NLP toolkit which is very good for named entity recognition and dependency parsing etc. What you could do is that you could tokenize the text using Spacy and select only the noun phrases/words using dependency parsing and use that as your vocabulary.
Dmitrij Burlaj over 6 years

Ok, how can I select only the noun words using dependency parsing in Spacy? Do you know some good tutorials about this?
Dmitrij Burlaj over 6 years

I have 32-bit Windows 7, is it possible to install Spacy on it? On the Spacy website, they say that it compatible with 64bit Python...
Dmitrij Burlaj over 6 years

Or can I do something similar with NLTK?
dhanush-ai1990 over 6 years

I think you can do something similar with NLTK. The below StackOverflow post can help you: stackoverflow.com/questions/7443330/…
Dmitrij Burlaj over 6 years

Thanks, I have parsed nouns with NLTK already, but unfortunately, this didn't help to get rid from senseless combinations of letters which CountVectorizer consider as words... And this makes my models to perform poorly. What else can be done?
Herc01 almost 5 years

I know I come late / but I just stumble into this . use the vocabulary attrib to get what you what you want
Herc01 almost 5 years

@Dmitrij Burlaj dont forget to give me a little if ever that works
Boubacar Traoré almost 5 years

Don't forget to use sorted(vectorizer.vocabulary_.keys()) in order to preserve the same order as vectorizer.get_feature_names() return features name already sorted !!!