Create data partition into training, testing and validation - split in R

17,016

A method using the sample() function in base R is

splitSample <- sample(1:3, size=nrow(data.hex), prob=c(0.7,0.15,0.15), replace = TRUE)
train.hex <- data.hex[splitSample==1,]
valid.hex <- data.hex[splitSample==2,]
test.hex <- data.hex[splitSample==3,]
Share:
17,016
Mahsolid
Author by

Mahsolid

Updated on June 04, 2022

Comments

  • Mahsolid
    Mahsolid about 2 years

    I wanted to split my training data in to 70% training, 15% testing and 15% validation. I am using the createDataPartition() function of the caret package. I am splitting it like the following

    train <- read.csv("Train.csv")
    test <- read.csv("Test.csv")
    
    split=0.70
    trainIndex <- createDataPartition(train$age, p=split, list=FALSE)
    data_train <- train[ trainIndex,]
    data_test <- train[-trainIndex,]
    

    Is there any way of splitting into training, testing and validation using createDataPartition() like the following H2o approach?

    data.hex <- h2o.importFile("Train.csv")
    splits <- h2o.splitFrame(data.hex, c(0.7,0.15), destination_frames = c("train","valid","test"))
    train.hex <- splits[[1]]
    valid.hex <- splits[[2]]
    test.hex  <- splits[[3]]
    
  • Mahsolid
    Mahsolid about 8 years
    > nrow(data.hex) [1] 25192 > nrow(train.hex) [1] 8398 > valid.hex <- data.hex[splitSample==2,] > nrow(valid.hex) [1] 8397 > test.hex<- data.hex[splitSample==3,] > nrow(test.hex) [1] 8397 but the difference between them is only 1. is this correct?
  • lmo
    lmo about 8 years
    Oops. Forgot the size argument.
  • lmo
    lmo about 8 years
    Note that this is (quasi) random, so the sizes will be approximately equal to 0.7, 0.15, 0.15, but not exactly. For replication purposes, you would want to set the seed above the first line: set.seed(some integer)