Create data partition into training, testing and validation - split in R
17,016
A method using the sample()
function in base R is
splitSample <- sample(1:3, size=nrow(data.hex), prob=c(0.7,0.15,0.15), replace = TRUE)
train.hex <- data.hex[splitSample==1,]
valid.hex <- data.hex[splitSample==2,]
test.hex <- data.hex[splitSample==3,]
Author by
Mahsolid
Updated on June 04, 2022Comments
-
Mahsolid about 2 years
I wanted to split my training data in to 70% training, 15% testing and 15% validation. I am using the
createDataPartition()
function of the caret package. I am splitting it like the followingtrain <- read.csv("Train.csv") test <- read.csv("Test.csv") split=0.70 trainIndex <- createDataPartition(train$age, p=split, list=FALSE) data_train <- train[ trainIndex,] data_test <- train[-trainIndex,]
Is there any way of splitting into training, testing and validation using
createDataPartition()
like the followingH2o
approach?data.hex <- h2o.importFile("Train.csv") splits <- h2o.splitFrame(data.hex, c(0.7,0.15), destination_frames = c("train","valid","test")) train.hex <- splits[[1]] valid.hex <- splits[[2]] test.hex <- splits[[3]]
-
Mahsolid about 8 years
> nrow(data.hex) [1] 25192 > nrow(train.hex) [1] 8398 > valid.hex <- data.hex[splitSample==2,] > nrow(valid.hex) [1] 8397 > test.hex<- data.hex[splitSample==3,] > nrow(test.hex) [1] 8397
but the difference between them is only 1. is this correct? -
lmo about 8 yearsOops. Forgot the size argument.
-
lmo about 8 yearsNote that this is (quasi) random, so the sizes will be approximately equal to 0.7, 0.15, 0.15, but not exactly. For replication purposes, you would want to set the seed above the first line:
set.seed(some integer)