How can I speed up the training of my random forest?
Solution 1
While I'm a fan of brute force techniques, such as parallelization or running a code for an extremely long time, I am an even bigger fan of improving an algorithm to avoid having to use a brute force technique.
While training your random forest using 2000 trees was starting to get prohibitively expensive, training with a smaller number of trees took a more reasonable time. For starters, you can train with say 4
, 8
, 16
, 32
, ...
, 256
, 512
trees and carefully observe metrics which let you know how robust the model is. These metrics include things like the best constant model (how well your forest performs on the data set versus a model which predicts the median for all inputs), as well as the out-of-bag error. In addition, you can observe the top predictors and their importance, and whether you start to see a convergence there as you add more trees.
Ideally, you should not have to use thousands of trees to build a model. Once your model begins to converge, adding more trees won't necessarily worsen the model, but at the same time it won't add any new information. By avoiding using too many trees you may be able to cut down a calculation which would have taken on the order of a week to less than a day. If, on top of this, you leverage a dozen CPU cores, then you might be looking at something on the order of hours.
To look at variable importance after each random forest run, you can try something along the lines of the following:
fit <- randomForest(...)
round(importance(fit), 2)
It is my understanding that the first say 5-10 predictors have the greatest impact on the model. If you notice that by increasing trees these top predictors don't really change position relative to each other, and the importance metrics seem to stay the same, then you might want to consider not using so many trees.
Solution 2
The randomForest()
function can accept data using either the "formula interface" or the "matrix interface". The matrix interface is known to deliver much better performance figures.
Formula interface:
rf.formula = randomForest(Species ~ ., data = iris)
Matrix interface:
rf.matrix = randomForest(y = iris[, 5], x = iris[, 1:4])
Solution 3
The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
- ranger (uses C++ compiler)
- randomForestSRC (uses C++ compiler)
- h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!
François M.
Updated on July 13, 2022Comments
-
François M. almost 2 years
I'm trying to train several random forests (for regression) to have them compete and see which feature selection and which parameters give the best model.
However the trainings seem to take an insane amount of time, and I'm wondering if I'm doing something wrong.
The dataset I'm using for training (called
train
below) has 217k lines, and 58 columns (of which only 21 serve as predictors in the random forest. They're allnumeric
orinteger
, with the exception of a boolean one, which is of classcharacter
. They
output isnumeric
).I ran the following code four times, giving the values
4
,100
,500
,2000
tonb_trees
:library("randomForest") nb_trees <- #this changes with each test, see above ptm <- proc.time() fit <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21, data = train, ntree = nb_trees, do.trace=TRUE) proc.time() - ptm
Here is how long each of them took to train :
nb_trees | time 4 4mn 100 1h 41mn 500 8h 40mn 2000 34h 26mn
As my company's server has 12 cores and 125Go of RAM, I figured I could try to parallelize the training, following this answer (however, I used the
doParallel
package because it seemed to be running forever withdoSNOW
, I don't know why. And I can't find where I saw thatdoParallel
would work too, sorry).library("randomForest") library("foreach") library("doParallel") nb_trees <- #this changes with each test, see table below nb_cores <- #this changes with each test, see table below cl <- makeCluster(nb_cores) registerDoParallel(cl) ptm <- proc.time() fit <- foreach(ntree = rep(nb_trees, nb_cores), .combine = combine, .packages = "randomForest") %dopar% { randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21, data = train, ntree = ntree, do.trace=TRUE)} proc.time() - ptm stopCluster(cl)
When I run it, it takes a shorter time than non-parallelized code :
nb_trees | nb_cores | total number of trees | time 1 4 4 2mn13s 10 10 100 52mn 9 12 108 (closest to 100 with 12 cores) 59mn 42 12 504 (closest to 500 with 12 cores) I won't be running this one 167 12 2004 (closest to 2000 with 12 cores) I'll run it next week-end
However, I think it's still taking a lot of time, isn't it ? I'm aware it takes time to combine the trees into the final forest, so I didn't expect it to be 12 times faster with 12 cores, but it's only ~2 times faster...
- Is this normal ?
- If it isn't, is there anything I can do with my data and/or my code to radically decrease the running time ?
- If not, should I tell the guy in charge of the server that it should be much faster ?
Thanks for your answers.
Notes :
- I'm the only one using this server
- for my next tests, I'll get rid of the columns that are not used in the random forest
- I realized quite late that I could improve the running time by calling
randomForest(predictors,decision)
instead ofrandomForest(decision~.,data=input)
, and I'll be doing it from now on, but I think my questions above still holds.
-
François M. almost 8 yearsThanks for your advice. I know I can see how the OOB error evolves (with non-parallelized code only though, as far as I know) with
do.trace = TRUE
, so that I can see how it evolves as a function of the number of tree. Is there a similar parameter to also see how the top predictors evolve ? (So that I can run the training only once, with 512 trees) -
Tim Biegeleisen almost 8 years@fmalaussena I updated my answer, please have a look.
-
François M. almost 8 yearsThanks. Do you know if this is specific to
randomForest()
or does it also work if I usemethod = 'rf'
incaret
? And what aboutmethod='ranger'
? -
user1808924 almost 8 yearsIIRC,
caret
performs method invocations using the "Matrix interface".