Random forest on a big dataset

18,180

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.

But it's hard to help more, given that you've provided no details about the actual code you're using.

Share:
18,180
ktdrv
Author by

ktdrv

Updated on July 17, 2022

Comments

  • ktdrv
    ktdrv almost 2 years

    I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.

    Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.

    Any suggestions or workaround ideas are much appreciated.