How to split data (raw text) into test/train sets with scikit crossvalidation module?

18,761

Suppose your data is a list of strings, i.e.

data = ["....", "...", ]

Then you can split it into training (80%) and test (20%) sets using train_test_split e.g. by doing:

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.2)

Before you rush doing it, though, read those docs through. 2500 is not a "large corpus" and you probably want to do something like a k-fold cross-validation rather than a single holdout split.

Share:
18,761
anon
Author by

anon

Updated on June 15, 2022

Comments

  • anon
    anon almost 2 years

    I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting raw text in test/train sets (probably i´ll use tf-idf representation).