How to split data (raw text) into test/train sets with scikit crossvalidation module?

machine-learning scikit-learn classification cross-validation text-classification

18,761

Suppose your data is a list of strings, i.e.

data = ["....", "...", ]

Then you can split it into training (80%) and test (20%) sets using train_test_split e.g. by doing:

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.2)

Before you rush doing it, though, read those docs through. 2500 is not a "large corpus" and you probably want to do something like a k-fold cross-validation rather than a single holdout split.

18,761

Author by

anon

Updated on June 15, 2022

Comments

anon almost 2 years

I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting raw text in test/train sets (probably i´ll use tf-idf representation).

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

How to perform SMOTE with cross validation in sklearn in python

Scikit learn - fit_transform on the test set

Classification report with Nested Cross Validation in SKlearn (Average/Individual values)

Plotting the ROC curve of K-fold Cross Validation

Implement K Neighbors Classifier in scikit-learn with 3 feature per object

Is Stochastic gradient descent a classifier or an optimizer?

Classification Report - Precision and F-score are ill-defined

Scikit-Learn Decision Tree: Probability of prediction being a or b?

TypeError: 'KFold' object is not iterable

GridSearchCV on LogisticRegression in scikit-learn