CountVectorizer method get_feature_names() produces codes but not words
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Dmitrij Burlaj
Updated on June 05, 2022Comments
-
Dmitrij Burlaj almost 2 years
I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english') X = vectorizer.fit_transform(df['message_encoding']) vectorizer.get_feature_names()
And I got the following output:
[u'00', u'000', u'0000', u'00000', u'000000000000000000', u'00001', u'000017', u'00001_copy_1', u'00002', u'000044392000001', u'0001', u'00012', u'0004', u'0005', u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
-
BENY over 6 yearsshow your data.head()
-
Dmitrij Burlaj over 6 yearsEntire dataframe?
-
Dmitrij Burlaj over 6 yearsIn fact, many of my words (features) become connected (without whitespace). I think it decreases the prediction accuracy of my model. What can cause this problem?
-
Boubacar Traoré almost 5 yearsThis question deserve more attention from
scikit-learn
community. This problem still exists and I waste the whole day trying to figure out what was wrong in my code ! This should have been fixed !
-
-
Dmitrij Burlaj over 6 yearsI already changed it to min_df=2, but this did not solve the problem. I am afraid that when I try to increase this hyperparameter more, then I can lose some important words, which can be very useful in the further classification of the text. I think I need to preprocess the text in some way instead, but I don't know how to do this.
-
dhanush-ai1990 over 6 yearsCan you try using Spacy to preprocess the text ?. Spacy is a NLP toolkit which is very good for named entity recognition and dependency parsing etc. What you could do is that you could tokenize the text using Spacy and select only the noun phrases/words using dependency parsing and use that as your vocabulary.
-
Dmitrij Burlaj over 6 yearsOk, how can I select only the noun words using dependency parsing in Spacy? Do you know some good tutorials about this?
-
Dmitrij Burlaj over 6 yearsI have 32-bit Windows 7, is it possible to install Spacy on it? On the Spacy website, they say that it compatible with 64bit Python...
-
Dmitrij Burlaj over 6 yearsOr can I do something similar with NLTK?
-
dhanush-ai1990 over 6 yearsI think you can do something similar with NLTK. The below StackOverflow post can help you: stackoverflow.com/questions/7443330/…
-
Dmitrij Burlaj over 6 yearsThanks, I have parsed nouns with NLTK already, but unfortunately, this didn't help to get rid from senseless combinations of letters which CountVectorizer consider as words... And this makes my models to perform poorly. What else can be done?
-
Herc01 almost 5 yearsI know I come late / but I just stumble into this . use the vocabulary attrib to get what you what you want
-
Herc01 almost 5 years@Dmitrij Burlaj dont forget to give me a little if ever that works
-
Boubacar Traoré almost 5 yearsDon't forget to use
sorted(vectorizer.vocabulary_.keys())
in order to preserve the same order asvectorizer.get_feature_names()
return features name already sorted !!!