How can I load specific columns of the .csv file to R?
Solution 1
If your table is very large, consider using the data.table
package:
# create an example: 10,000 rows by 100 columns
df <- data.frame(matrix(rnorm(1e6),ncol=100))
write.csv(df,"sample.csv",row.names=F)
library(data.table)
dt <- fread("sample.csv",select=c(3,8,20))
head(dt)
# X3 X8 X20
# 1: 0.5537762 1.0271272 -0.14437400
# 2: -0.4111327 -0.2297311 -1.04998490
# 3: -1.2540440 0.6977565 -0.21514021
# 4: -1.1500974 -0.3181102 -0.07910133
# 5: -0.6549245 1.8385510 0.73741980
# 6: 0.8049360 0.4722533 -0.65750679
This reads in just columns 3, 8, and 20 and is extremely fast.
Solution 2
Can you pre-process using awk before loading into R? If so, say you want columns 2,3 and 5, you can do:
awk '{print $2,$3,$5}' yourfile.csv > cols23and5.csv
If your CSV file is quoted (e.g. because some fields contain commas), and looks like this:
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
you can remove the double quotes and change the field separator from commas into, say colons, like this:
sed -e 's/","/:/g' -e 's/"//g' yourfile.csv > ColonSeparated.csv
so that your file becomes:
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
then you can process it with awk
using the colon as a separator and without the embedded commas worrying you:
awk -F: '{print $2,FS,$3,FS,$4}' ColonSeparated.csv > SmallFileForR.csv
P.Escondido
Updated on June 07, 2022Comments
-
P.Escondido almost 2 years
I have a huge dataset stored as .csv file, which does not fit into memory. However, I only need 3 columns from the data set which would fit into memory. How can I load them?
UPD: Ho can I choose columns by names rather than by their column indices? I do not know their indices a priori
-
Rahul Premraj about 10 yearsYou could take a look at the link: stackoverflow.com/questions/5788117/… where I have posted another solution.
-
-
P.Escondido about 10 yearsI have quoted strings in my file, therefore awk screws up the job :(
-
Mark Setchell about 10 yearsI have updated my answer to remove double and single quotes - please have another look.
-
P.Escondido about 10 yearssorry, I meant that one of the columns is the string of the form "hello, world, hello". The separator is "," but it can also be a part of that string. Perhaps one can change the separator to something that never shows up - e.g. #$?
-
Mark Setchell about 10 yearsI don't know if you can edit your question now it has been marked as a duplicate. If you can, it would help a lot if you could show 3-5 lines of your data and say which fields you want. If you can't, you are welcome to click on my profile and find my Email address there and send it to me - although it is just about bedtime in the UK, so I may only answer tomorrow.
-
Mark Setchell about 10 yearsI have updated my answer to show how you can handle double-quoted fields in your CSV file.