Understanding min_df and max_df in scikit CountVectorizer

python machine-learning scikit-learn nlp

81,785

Solution 1

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently. For example:

min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

Solution 2

I would add this point also for understanding min_df and max_df in tf-idf better.

If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

BUT the quality of your clustering should NOT be reduced.

One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn't. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

So to sum it up, pruning the terms via min_df and max_df is to improve the performance, not the quality of clusters (as an example).

And the crucial point is that if you set the min and max mistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine's processing capabilities, leave the min, max parameters unchanged.

Solution 3

As per the CountVectorizer documentation here.

When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

When using an int it refers to absolute number of documents that hold this term.

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

max_doc_count = (max_df
                 if isinstance(max_df, numbers.Integral)
                 else max_df * n_doc)

The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

"""Remove too rare or too common features.

Prune features that are non zero in more samples than high or less
documents than low, modifying the vocabulary, and restricting it to
at most the limit most frequent.

This does not prune samples with zero features.
"""

Solution 4

The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don't do anything at all.

That being said, I believe the currently accepted answer by @Ffisegydd answer isn't quite correct.

For example, run this using the defaults, to see that when min_df=1 and max_df=1.0, then

1) all tokens that appear in at least one document are used (e.g., all tokens!)

2) all tokens that appear in all documents are used (we'll test with one candidate: everywhere).

cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True) 
# here is just a simple list of 3 documents.
corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
# below we call fit_transform on the corpus and get the feature names.
X = cv.fit_transform(corpus)
vocab = cv.get_feature_names()
print vocab
print X.toarray()
print cv.stop_words_

We get:

[u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
[[0 1 0 0 0 1 0 0 1 1]
 [0 1 1 1 0 0 0 1 0 0]
 [1 1 0 0 1 0 1 0 0 0]]
set([])

All tokens are kept. There are no stopwords.

Further messing around with the arguments will clarify other configurations.

For fun and insight, I'd also recommend playing around with stop_words = 'english' and seeing that, peculiarly, all the words except 'seven' are removed! Including `everywhere'.

Solution 5

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.

Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

See some usage examples here.

View more solutions

81,785

moeabdol

A rare mix of a scientist engineer. Finds joy in implementing digital solutions end-to-end. I wear many hats and speak many tongues. With breadth and depth knowledge of anything digital. From software engineering best practices to neural networks and artificial intelligence.

Updated on November 10, 2021

Comments

moeabdol over 2 years

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?

What are the differences when min_df and max_df are provided as integers or as floats?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df and max_df?
Monica Heddneck over 8 years

This is confusing. The documentation for min_df says 'ignore terms that have a document frequency strictly lower than the given threshold.' So frequency strictly lower than the default of 1 would mean to ignore terms that never appear (!) but keep terms that appear once.
Kevin Markham about 8 years

@MonicaHeddneck is correct. This answer misinterprets the precise meanings of min_df and max_df. I added an answer that explains exactly how these parameters are interpreted by CountVectorizer.
Isopycnal Oscillation over 4 years

Thank you - this is the same conclusion I have reached independently.
jeremy_rutman about 4 years

yeah, this is answer is somewhat wrong as @MonicaHeddneck and kevin pointed out, both for min_df and max_df
lgc_ustc almost 3 years

Wish this appeared in the official doc to avoid much unclarity and confusion.