Load a small random sample from a large csv file into R data frame
16,385
Solution 1
You can also just do it in the terminal with perl.
perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt
This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.
Solution 2
Try this based on examples 6e and 6f on the sqldf github home page:
library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")
See ?read.csv.sql
using other arguments as needed based on the particulars of your file.
Solution 3
This should work:
RowsInCSV = 10000000 #Or however many rows there are
List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)
Author by
P.Escondido
Updated on June 15, 2022Comments
-
P.Escondido almost 2 years
The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?
-
P.Escondido about 10 yearsis it as fast as via Perl?
-
Señor O about 10 yearsDoubt it. Takes about 6 seconds on my machine, so it doesn't really make a difference unless you have to do it all the time.
-
pomber over 9 yearsnice, any way to keep the csv header?
-
geotheory about 9 years@pomber you could first copy the header line (e.g.
head -1 file.txt > sample.txt
) and then run the perl operation with>>
instead to append -
Gregor Thomas about 9 yearsNot at all helpful if, as OP says, "The csv file to be processed does not fit into the memory".
-
Doon_Bogan almost 9 yearsIs there a way to do this with Python?
-
Hack-R about 8 yearsFor Windows you'd need to change the
'
to"
-
Conner M. almost 6 yearsTried this using a csv as the bigFile, but it copied the whole file.
-
Maxwell Chandler over 5 yearsI tried with windows and csv and it worked fine. Thanks!
-
pascal over 3 yearscould it be that the arguments in the sample function are inverted? sample(RowsInCSV, 1)? Furthermore, I think a bracket in the end of the lapply command is missing.