How can I load specific columns of the .csv file to R?

r csv bigdata

17,714

Solution 1

If your table is very large, consider using the data.table package:

# create an example: 10,000 rows by 100 columns
df <- data.frame(matrix(rnorm(1e6),ncol=100))
write.csv(df,"sample.csv",row.names=F)

library(data.table)
dt <- fread("sample.csv",select=c(3,8,20))
head(dt)
#            X3         X8         X20
# 1:  0.5537762  1.0271272 -0.14437400
# 2: -0.4111327 -0.2297311 -1.04998490
# 3: -1.2540440  0.6977565 -0.21514021
# 4: -1.1500974 -0.3181102 -0.07910133
# 5: -0.6549245  1.8385510  0.73741980
# 6:  0.8049360  0.4722533 -0.65750679

This reads in just columns 3, 8, and 20 and is extremely fast.

Solution 2

Can you pre-process using awk before loading into R? If so, say you want columns 2,3 and 5, you can do:

awk '{print $2,$3,$5}' yourfile.csv > cols23and5.csv

If your CSV file is quoted (e.g. because some fields contain commas), and looks like this:

"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"

you can remove the double quotes and change the field separator from commas into, say colons, like this:

sed -e 's/","/:/g' -e 's/"//g' yourfile.csv > ColonSeparated.csv

so that your file becomes:

Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,

then you can process it with awk using the colon as a separator and without the embedded commas worrying you:

awk -F: '{print $2,FS,$3,FS,$4}' ColonSeparated.csv   > SmallFileForR.csv

17,714

Author by

P.Escondido

Updated on June 07, 2022

Comments

P.Escondido almost 2 years

I have a huge dataset stored as .csv file, which does not fit into memory. However, I only need 3 columns from the data set which would fit into memory. How can I load them?

UPD: Ho can I choose columns by names rather than by their column indices? I do not know their indices a priori
- Rahul Premraj about 10 years
  
  You could take a look at the link: stackoverflow.com/questions/5788117/… where I have posted another solution.
P.Escondido about 10 years

I have quoted strings in my file, therefore awk screws up the job :(
Mark Setchell about 10 years

I have updated my answer to remove double and single quotes - please have another look.
P.Escondido about 10 years

sorry, I meant that one of the columns is the string of the form "hello, world, hello". The separator is "," but it can also be a part of that string. Perhaps one can change the separator to something that never shows up - e.g. #$?
Mark Setchell about 10 years

I don't know if you can edit your question now it has been marked as a duplicate. If you can, it would help a lot if you could show 3-5 lines of your data and say which fields you want. If you can't, you are welcome to click on my profile and find my Email address there and send it to me - although it is just about bedtime in the UK, so I may only answer tomorrow.
Mark Setchell about 10 years

I have updated my answer to show how you can handle double-quoted fields in your CSV file.