Is there a way to guess the size of data.frame based on rows, columns and variable types?

13,793

Solution 1

You can simulate an object and compute an estimation of the memory that is being used to store it as an R object using object.size:

m <- matrix(1,nrow=1e5,ncol=150)
m <- as.data.frame(m)
m[,1:20] <- sapply(m[,1:20],as.character)
m[,29:30] <- sapply(m[,29:30],as.factor)
object.size(m)
120017224 bytes
print(object.size(m),units="Gb")
0.1 Gb

Solution 2

You could create dummy variables that store examples of the data you will be storing in the dataframe.

Then use object.size() to find their size and multiply with the rows and columns accordingly.

Solution 3

Check out pryr package as well. It has object_size which may be slightly better for you. From the advanced R

This function is better than the built-in object.size() because it accounts for shared elements within an object and includes the size of environments.

You also need to account for the size of attributes as well as the column types etc.

object.size(attributes(m))
Share:
13,793
Ajay Ohri
Author by

Ajay Ohri

http://about.me/aohri I am a data scientist with a startup at http://decisionstats.org I have written two books on R. R for Cloud Computing and R for Business Analytics. I blog at DecisionStats.com

Updated on July 06, 2022

Comments

  • Ajay Ohri
    Ajay Ohri almost 2 years

    I am expecting to generate a lot of data and then catch it R. How can I estimate the size of the data.frame (and thus memory needed) by the number of rows, number of columns and variable types?

    Example.

    If I have 10000 rows and 150 columns out of which 120 are numeric, 20 are strings and 10 are factor level, what is the size of the data frame I can expect? Will the results change depending on the data stored in the columns (as in max(nchar(column)))?

    > m <- matrix(1,nrow=1e5,ncol=150)
    > m <- as.data.frame(m)
    > object.size(m)
    120009920 bytes
    > a=object.size(m)/(nrow(m)*ncol(m))
    > a
    8.00066133333333 bytes
    > m[,1:150] <- sapply(m[,1:150],as.character)
    > b=object.size(m)/(nrow(m)*ncol(m))
    > b
    4.00098133333333 bytes
    > m[,1:150] <- sapply(m[,1:150],as.factor)
    > c=object.size(m)/(nrow(m)*ncol(m))
    > c
    4.00098133333333 bytes
    > m <- matrix("ajayajay",nrow=1e5,ncol=150)
    > 
    > m <- as.data.frame(m)
    > object.size(m)
    60047120 bytes
    > d=object.size(m)/(nrow(m)*ncol(m))
    > d
    4.00314133333333 bytes
    
  • nicola
    nicola almost 9 years
    I guess that the point is to know the size before creating it.
  • agstudy
    agstudy almost 9 years
    @nicola it is an estimation. You can use as a reference assuming that the memory allocation is a linear function..
  • Pierre L
    Pierre L almost 9 years
    This solution works because after doing a few examples the user will be able to better estimate their output for other cases.
  • nicola
    nicola almost 9 years
    Yes, but what's the point of generating an object so similar to the one we want to create? One can just create the object and see. I guess that maybe one can see how much space a single column of a single type requires and do some math. This may work even for very large objects.
  • agstudy
    agstudy almost 9 years
    @nicola well but I think you simplify too much how memory is allocated, I don't know what do you mean by some math but I don't think you can get the size memory using some additions.. take a look at memory.profile()
  • Ajay Ohri
    Ajay Ohri almost 9 years
    because of expected variances in size of object and also the practicalities of working in a server environment with many processes and users firing code, memory estimation is useful
  • nicola
    nicola almost 9 years
    In your example for instance, see how much close are object.size(m) and object.size(m[[1]])*ncol(m). No reason to generate 150 columns.