Plotting of very large data sets in R

17,195

Solution 1

In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:

ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
 stopifnot(all(qs<=1 & qs>=0))
 ffsort(ffv,...)->ffvs
 j<-(qs*(length(ffv)-1))+1
 jf<-floor(j);ceiling(j)->jc
 rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}

This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.

Solution 2

Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

Good, take a simple .csv file, then following function samples a fraction p of the data:

sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #read header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #read and process data
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}

An example of the usage :

#Make a file
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#clean up
unlink("test.txt")

Solution 3

You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).

There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).

Solution 4

You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149

Solution 5

All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.

Share:
17,195
Daniel Arndt
Author by

Daniel Arndt

Updated on June 25, 2022

Comments

  • Daniel Arndt
    Daniel Arndt about 2 years

    How can I plot a very large data set in R?

    I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?