ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df

python scikit-learn feature-extraction tf-idf

10,019

From the documentation, scikit-learn, TF-IDF vectorizer,

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Please check the data type of the variable, totalvocab_stemmed_body . If it is a list, each element of the list is considered as a document.

Case 1: No of documents=20,00,000, min_df=0.5.

If you have a large number of files (say 2 Million), and each has a few words only, and are from very different domains, there's very less chance that there are terms which are present in minimum, 10,00,000 (20,00,000 * 0.5 ) documents.

Case 2: No of documents=200, max_df=0.95

If you have a set of repeated files (say 200), you will see that the terms are present in most of the documents. With max_df=0.95, you are telling that those terms which are present in more than 190 files, do not consider them. In this case, all terms are more or less repeated, and your vectorizer won't be able to find out any terms for the matrix.

This is my thought on this topic.

10,019

Author by

Jeet Dadhich

Updated on June 27, 2022

Comments

Jeet Dadhich almost 2 years

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=200000,
                             min_df=.5, stop_words='english',
                             use_idf=True,sublinear_tf=True,tokenizer = tokenize_and_stem_body,ngram_range=(1,3))
tfidf_matrix_body = tfidf_vectorizer.fit_transform(totalvocab_stemmed_body)

The above code gives me the error

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

Can anyone help me out on the same and I have change all value 80 to 100 but issue remain same?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Attribute error while using scikit-learn

TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

converting scipy.sparse.csr.csr_matrix to a list of lists

AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

Find the tf-idf score of specific words in documents using sklearn

RandomForestRegressor and feature_importances_ error

AttributeError: getfeature_names not found ; using scikit-learn

TFIDF Vectorizer giving error

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

How to see top n entries of term-document matrix after tfidf in scikit-learn

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df

Jeet Dadhich

Comments

Recents

Related