CountVectorizer method get_feature_names() produces codes but not words

10,825

You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.

Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.

Share:
10,825
Dmitrij Burlaj
Author by

Dmitrij Burlaj

Updated on June 05, 2022

Comments

  • Dmitrij Burlaj
    Dmitrij Burlaj almost 2 years

    I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:

    vectorizer = CountVectorizer(min_df=1, stop_words='english')
    X = vectorizer.fit_transform(df['message_encoding'])
    vectorizer.get_feature_names()
    

    And I got the following output:

    [u'00',
    u'000',
    u'0000',
    u'00000',
    u'000000000000000000',
    u'00001',
    u'000017',
    u'00001_copy_1',
    u'00002',
    u'000044392000001',
    u'0001',
    u'00012',
    u'0004',
    u'0005',
    u'00077d3',
    

    and so on.

    I need real feature names (words), not these codes. Can anybody help me please?

    UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?

    Screenshot

    Screensot of the data.head()

    • BENY
      BENY over 6 years
      show your data.head()
    • Dmitrij Burlaj
      Dmitrij Burlaj over 6 years
      Entire dataframe?
    • Dmitrij Burlaj
      Dmitrij Burlaj over 6 years
      In fact, many of my words (features) become connected (without whitespace). I think it decreases the prediction accuracy of my model. What can cause this problem?
    • Boubacar Traoré
      Boubacar Traoré almost 5 years
      This question deserve more attention from scikit-learn community. This problem still exists and I waste the whole day trying to figure out what was wrong in my code ! This should have been fixed !
  • Dmitrij Burlaj
    Dmitrij Burlaj over 6 years
    I already changed it to min_df=2, but this did not solve the problem. I am afraid that when I try to increase this hyperparameter more, then I can lose some important words, which can be very useful in the further classification of the text. I think I need to preprocess the text in some way instead, but I don't know how to do this.
  • dhanush-ai1990
    dhanush-ai1990 over 6 years
    Can you try using Spacy to preprocess the text ?. Spacy is a NLP toolkit which is very good for named entity recognition and dependency parsing etc. What you could do is that you could tokenize the text using Spacy and select only the noun phrases/words using dependency parsing and use that as your vocabulary.
  • Dmitrij Burlaj
    Dmitrij Burlaj over 6 years
    Ok, how can I select only the noun words using dependency parsing in Spacy? Do you know some good tutorials about this?
  • Dmitrij Burlaj
    Dmitrij Burlaj over 6 years
    I have 32-bit Windows 7, is it possible to install Spacy on it? On the Spacy website, they say that it compatible with 64bit Python...
  • Dmitrij Burlaj
    Dmitrij Burlaj over 6 years
    Or can I do something similar with NLTK?
  • dhanush-ai1990
    dhanush-ai1990 over 6 years
    I think you can do something similar with NLTK. The below StackOverflow post can help you: stackoverflow.com/questions/7443330/…
  • Dmitrij Burlaj
    Dmitrij Burlaj over 6 years
    Thanks, I have parsed nouns with NLTK already, but unfortunately, this didn't help to get rid from senseless combinations of letters which CountVectorizer consider as words... And this makes my models to perform poorly. What else can be done?
  • Herc01
    Herc01 almost 5 years
    I know I come late / but I just stumble into this . use the vocabulary attrib to get what you what you want
  • Herc01
    Herc01 almost 5 years
    @Dmitrij Burlaj dont forget to give me a little if ever that works
  • Boubacar Traoré
    Boubacar Traoré almost 5 years
    Don't forget to use sorted(vectorizer.vocabulary_.keys()) in order to preserve the same order as vectorizer.get_feature_names() return features name already sorted !!!