Skip all leading empty lines in read.csv

13,501

Solution 1

read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv

After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:

library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)

The last line can be omitted if a data.table is ok as the returned value.

Solution 2

Depending on your file size, this may be not the best solution but will do the job.

Strategy here is, instead of reading file with delimiter, will read as lines, and count the characters and store into temp. Then, while loop will search for first non-zero character length in the list, then will read the file, and store as data_filename.

flist = list.files()
for (onefile in flist) {
  temp = nchar(readLines(onefile))
  i = 1
  while (temp[i] == 0) {
    i = i + 1
  }
  temp = read.table(onefile, sep = ",", skip = (i-1))
  assign(paste0(data, onefile), temp)
}

If file contains headers, you can start i from 2.

Solution 3

If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:

df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))

It's not efficient if you have large files (since you have to import twice), but it works.

If you want to import a tab-delimited file with the same problem (variable blank lines) then use:

df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
Share:
13,501
Alex
Author by

Alex

Updated on June 18, 2022

Comments

  • Alex
    Alex almost 2 years

    I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.

    How do I work out how many lines are empty, and dynamically skip them for each file?

    As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:

    ,,,
    w,x,y,z
    a,b,5,c
    a,b,5,c
    a,b,5,c
    a,b,4,c
    a,b,4,c
    a,b,4,c
    

    which means there are rows of commas at the start.