skip some rows in read.csv in R

14,020

Solution 1

It is possible using sqldf package, using read.csv.sql

Lets say the contents of sample.csv looks like this:

id,name,age
1,"a",23
2,"b",24
3,"c",23

Now to read only rows where age=23:

require(sqldf)

df <- read.csv.sql("sample.csv", "select * from file where age=23")

df
  id name age
1  1  "a"  23
2  3  "c"  23

It is possible to select necessary columns:

df <- read.csv.sql("sample.csv", "select id, name from file where age=23")
df
  id name
1  1  "a"
2  3  "c"

Solution 2

It is better to read all and subset later like suggested in the comment :

csvData [!csvData$ticker %in% c('ADCT','ABT'),]

EDIT

You can use fread from data.table package for more efficient method to read your file.

library(read.table)
fread(file="pf.csv")
Share:
14,020

Related videos on Youtube

Dinoop Nair
Author by

Dinoop Nair

software developer @Algotree, cochin, india

Updated on September 14, 2022

Comments

  • Dinoop Nair
    Dinoop Nair over 1 year

    I have a csv file which I read using the following function:

    csvData <- read.csv(file="pf.csv", colClasses=c(NA, NA,"NULL",NA,"NULL",NA,"NULL","NULL","NULL"))
    dimnames(csvData)[[2]]<- c("portfolio", "date", "ticker", "quantity")
    

    It reads all lines from that file. But i want to skip some rows from reading. The row should not read if the value of the ticker-column is: ABT or ADCT. Is it possible?

    sample of my csv file is as follows:

    RUS1000,01/29/1999,21st Centy Ins Group,TW.Z,90130N10,72096,1527.534,0.01,21.188
    RUS1000,01/29/1999,3com Corp,COMS,88553510,358764,16861.908,0.16,47.000
    RUS1000,01/29/1999,3m Co,MMM,88579Y10,401346,31154.482,0.29,77.625
    RUS1000,01/29/1999,A D C Telecommunicat,ADCT,00088630,135114,5379.226,0.05,39.813
    RUS1000,01/29/1999,Abbott Labs,ABT,00282410,1517621,70474.523,0.66,46.438
    RUS1000,02/26/1999,21st Centy Ins Group,TW.Z,90130N10,72096,1378.836,0.01,19.125
    RUS1000,02/26/1999,3com Corp,COMS,88553510,358764,11278.644,0.11,31.438
    RUS1000,02/26/1999,3m Co,MMM,88579Y10,402146,29783.938,0.29,74.063 
    
    • Roman Luštrik
      Roman Luštrik about 11 years
      Use readLines and use regular expressions to filter out undesirable rows.
    • topchef
      topchef about 11 years
      before using R run: sed '/(,ADCT,|,ABT,)/d' > newfile
    • A5C1D2H2I1M1N2O1R2T1
      A5C1D2H2I1M1N2O1R2T1
      Why not read in the whole file and subset later?
  • Roman Luštrik
    Roman Luštrik about 11 years
    Assuming file was too big to read into memory, what would be OP's choices then?
  • Dinoop Nair
    Dinoop Nair about 11 years
    actually the file having 200mb+ and majority of the data contains these values.So i think it is not efficient.
  • CHP
    CHP about 11 years
    Can you elaborate the answer? Then it will be more likely to help your answer gain attention.
  • Nishanth
    Nishanth about 11 years
    I have updated my answer with a simple, self-contained example.
  • G. Grothendieck
    G. Grothendieck about 8 years
    read.csv.sql is based on SQLite facilities, not on read.table so one can't really expect that it works exactly the same; however, the filter= and sql= arguments of read.csv.sql can be used for arbitrary preprocessing so depending on what your data looks like arbitrary missing value situations can usually be handled. For example, filter = "sed -e s/NULL//g" would remove all occurrences of the string NULL. (On Windows it will assume you have Rtools installed and get sed from there.)
  • Chris
    Chris about 8 years
    I tried swapping NULLs with empty strings but it wasn't interpreted as NA. Your trick AFAIK would not make it treat the values as NA.
  • G. Grothendieck
    G. Grothendieck about 8 years
    If its a numeric field, say, then they will be treated as 0 and you can replace them on the R end. Or replace NULL with -99, say, and replace them on the R end.
  • Chris
    Chris about 8 years
    So a bit of a hack. I'd have to choose a magic numeric and hope it doesn't exist. 0, -1 and -99 are far too common. My function is speedy especially with larger values of n