Load a small random sample from a large csv file into R data frame

16,385

Solution 1

You can also just do it in the terminal with perl.

perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt

This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.

Solution 2

Try this based on examples 6e and 6f on the sqldf github home page:

library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")

See ?read.csv.sql using other arguments as needed based on the particulars of your file.

Solution 3

This should work:

RowsInCSV = 10000000 #Or however many rows there are

List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)
Share:
16,385
P.Escondido
Author by

P.Escondido

Updated on June 15, 2022

Comments

  • P.Escondido
    P.Escondido almost 2 years

    The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?

  • P.Escondido
    P.Escondido about 10 years
    is it as fast as via Perl?
  • Señor O
    Señor O about 10 years
    Doubt it. Takes about 6 seconds on my machine, so it doesn't really make a difference unless you have to do it all the time.
  • pomber
    pomber over 9 years
    nice, any way to keep the csv header?
  • geotheory
    geotheory about 9 years
    @pomber you could first copy the header line (e.g. head -1 file.txt > sample.txt) and then run the perl operation with >> instead to append
  • Gregor Thomas
    Gregor Thomas about 9 years
    Not at all helpful if, as OP says, "The csv file to be processed does not fit into the memory".
  • Doon_Bogan
    Doon_Bogan almost 9 years
    Is there a way to do this with Python?
  • Hack-R
    Hack-R about 8 years
    For Windows you'd need to change the ' to "
  • Conner M.
    Conner M. almost 6 years
    Tried this using a csv as the bigFile, but it copied the whole file.
  • Maxwell Chandler
    Maxwell Chandler over 5 years
    I tried with windows and csv and it worked fine. Thanks!
  • pascal
    pascal over 3 years
    could it be that the arguments in the sample function are inverted? sample(RowsInCSV, 1)? Furthermore, I think a bracket in the end of the lapply command is missing.