How to loop through a folder of CSV files in R

16,234

Solution 1

My favourite way to do this is using ldply from the plyr package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.

Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. But that's up to you.

Solution 2

Using purrr

library(tidyverse)

files <- list.files(path = "./data/", pattern = "*.csv")

df <- files %>% 
    map(function(x) {
        read.csv(paste0("./data/", x))
    }) %>%
    reduce(rbind)

Solution 3

Consider an anonymous function within an lapply():

files = list.files(pattern="*.txt")

dfList <- lapply(files, function(i) {
     df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
     df$Year <- gsub("yob", "", i) 
     return(df)
})

finaldf <- do.call(rbind, dflist)

Solution 4

A for loop might be more appropriate than lapply in this case.

file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))

for (i in seq_along(file_list)) {
    filename = file_list[[i]]

    # Read data in
    df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))

    # Extract year from filename
    year = gsub("yob", "", filename)
    df[["Filename"]] = year

    # Add year to data_list
    data_list[[i]] <- df
}

babynames <- do.call(rbind, data_list)
Share:
16,234
krypticlol
Author by

krypticlol

Updated on June 22, 2022

Comments

  • krypticlol
    krypticlol almost 2 years

    I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.

    I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc

    Here is what I have:

    file_list <- list.files()
    
    temp = list.files(pattern="*.txt")
    babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))
    
    names(babynames) <- c("Name", "Gender", "Count")
    

    I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?