Create data partition into training, testing and validation - split in R

r machine-learning classification r-caret

17,016

A method using the sample() function in base R is

splitSample <- sample(1:3, size=nrow(data.hex), prob=c(0.7,0.15,0.15), replace = TRUE)
train.hex <- data.hex[splitSample==1,]
valid.hex <- data.hex[splitSample==2,]
test.hex <- data.hex[splitSample==3,]

17,016

Author by

Mahsolid

Updated on June 04, 2022

Comments

Mahsolid about 2 years

I wanted to split my training data in to 70% training, 15% testing and 15% validation. I am using the createDataPartition() function of the caret package. I am splitting it like the following

train <- read.csv("Train.csv")
test <- read.csv("Test.csv")

split=0.70
trainIndex <- createDataPartition(train$age, p=split, list=FALSE)
data_train <- train[ trainIndex,]
data_test <- train[-trainIndex,]

Is there any way of splitting into training, testing and validation using createDataPartition() like the following H2o approach?

data.hex <- h2o.importFile("Train.csv")
splits <- h2o.splitFrame(data.hex, c(0.7,0.15), destination_frames = c("train","valid","test"))
train.hex <- splits[[1]]
valid.hex <- splits[[2]]
test.hex  <- splits[[3]]

Mahsolid about 8 years

> nrow(data.hex) [1] 25192 > nrow(train.hex) [1] 8398 > valid.hex <- data.hex[splitSample==2,] > nrow(valid.hex) [1] 8397 > test.hex<- data.hex[splitSample==3,] > nrow(test.hex) [1] 8397 but the difference between them is only 1. is this correct?
lmo about 8 years

Oops. Forgot the size argument.
lmo about 8 years

Note that this is (quasi) random, so the sizes will be approximately equal to 0.7, 0.15, 0.15, but not exactly. For replication purposes, you would want to set the seed above the first line: set.seed(some integer)