Loop function and add columns to dataframe in R

19,572

Solution 1

Try this:

Just to clear the confusion.

dat1=as.data.frame(matrix(rnorm(25),ncol=5))
dat5=as.data.frame(matrix(rnorm(25),ncol=5))
dat7=as.data.frame(matrix(rnorm(25),ncol=5))

my_fun <-  function(dataframe){
rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE) 
}

dfList<-list(dat1,dat5,dat7)

Vars <- grep("dat", ls(), value=TRUE)
Vars
 #[1] "dat1" "dat5" "dat7"

res <- lapply(dfList, function(x) transform(x,V6=my_fun(x)))
for(i in 1:length(Vars)){
assign(Vars[i], res[[i]],envir=.GlobalEnv)
}

A Second function:

my_funSD <-  function(dataframe){
apply( dataframe[ , c("V1","V2")],1,sd) 
}

dfList<-list(dat1,dat5,dat7)
res1 <- lapply(dfList, function(x) transform(x,V7=my_funSD(x)))
ncol(dat1)
 #[1] 6

for(i in 1:length(Vars)){
assign(Vars[i], res1[[i]],envir=.GlobalEnv)
}
ncol(dat1)
#[1] 7
colnames(dat1)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7"

Solution 2

It's soo much easier and at least 100x faster if you use data.table

require(data.table)
set.seed(612)
dat1=as.data.table(matrix(rnorm(25),ncol=5))
dat2=as.data.table(matrix(rnorm(25),ncol=5))
dat3=as.data.table(matrix(rnorm(25),ncol=5))

dtList <- list(dat1, dat2, dat3)

for (dat in dtList) {
     dat[,V6:=(V1 + V2)/2]
}

This gives the following output:

> dtList

[[1]]
           V1         V2         V3         V4          V5          V6
1:  0.3903228 -1.1581608  1.0171311  0.3866628  0.02756137 -0.38391897
2: -0.6030124  0.4713771 -2.4204376 -0.2843527  0.53463600 -0.06581764
3: -0.9850333  0.3343518 -1.2329712 -1.1767533  0.56714483 -0.32534080
4: -0.1591335 -0.6729444  0.5062648 -0.3001857 -0.84896068 -0.41603897
5:  1.7127203  0.3149884  1.7633945  1.7824786 -0.90316850  1.01385434

[[2]]
            V1         V2         V3          V4         V5         V6
1: -1.22790810  0.8429506  0.4921844 -0.29686607 -0.9501956 -0.1924788
2:  0.09405923 -1.6970403  0.1280003  1.22284944  0.8667643 -0.8014905
3:  0.55298783 -0.1081849  0.4120268 -0.56411756  1.9135802  0.2224015
4: -0.82621808  0.4753731  0.4755664 -0.05885804  0.9658787 -0.1754225
5:  0.44262554  0.3036363 -1.7404580  0.88870595  1.4826431  0.3731309

[[3]]
            V1          V2         V3        V4          V5          V6
1:  0.82085834  0.07221027  1.8835042 0.2563714  0.27891033  0.44653430
2:  0.00445113  1.89450534  0.3878858 1.8385587 -1.86381524  0.94947824
3:  0.66458950 -1.31023362 -0.9403257 1.2128128  0.74922668 -0.32282206
4: -1.40169143 -1.52925147  0.8232823 0.3391147  0.33463875 -1.46547145
5:  1.10566340 -1.16512217  0.3859652 0.8123110  0.04712086 -0.02972939

Solution 3

You could adjust your function so that it assembles the data frame before returning the result. Then you can use lapply with the other data frames.

> add.column <-  function(dataframe){
      dataframe$Mean <- rowMeans(dataframe[, c("V1","V2")], na.rm = TRUE) 
      dataframe
  }
> lapply(list(dat1, dat2, dat3), add.column)
[[1]]
          V1          V2          V3          V4         V5       Mean
1  0.3353508  0.01120530  1.24966428 -0.11351381 -0.1004569  0.1732780
2 -0.4815519 -0.06247284 -0.02295877  0.87322960 -0.5307487 -0.2720124
3  0.7600038  1.85289546  0.36312147 -0.02831645 -1.6644752  1.3064496
4  0.1679056  1.27509579  0.80371659 -0.85136078 -0.1014557  0.7215007
5 -1.7393888  0.72481368  0.06583188 -0.61422707  1.7497541 -0.5072876

[[2]]
          V1         V2         V3          V4         V5        Mean
1 -1.5753298 -0.6997616 -0.4343483  0.04134617 -2.3060171 -1.13754567
2  0.1871898 -0.2132065 -0.1839468 -0.29277828  1.6798379 -0.01300837
3 -0.1530399 -1.0674358 -0.2897129 -0.49723256 -1.0571164 -0.61023786
4 -1.5099567  0.4665531 -0.7433692  1.71985447 -0.3236015 -0.52170179
5 -0.6545995 -0.5638415 -0.5925143 -0.10060696  0.1332452 -0.60922051

[[3]]
          V1          V2          V3         V4         V5        Mean
1  0.1716287 -0.09367833  0.12334527 -0.0061906  1.4477591  0.03897519
2 -1.1550710  0.26045021  1.80286581  0.4236090 -0.0848587 -0.44731037
3  0.3735215 -0.70953979 -0.08194104  0.7470116  0.6457781 -0.16800913
4  2.0178667 -0.90537252  1.51190214 -1.0803367  0.7308533  0.55624709
5  0.4143222  0.89980628 -1.87445683 -1.8482057  0.1797795  0.65706422

Also, you might want to add a subset argument in case you want to take the mean of other subsets later. Something lie this:

> add.column <-  function(data, subset)
  {
      data$Mean <- rowMeans(data[, subset], na.rm = TRUE) 
      data
  }

Solution 4

Could use Map between dfList and your lapply function (instead of sapply)

temp <- Map(cbind, dfList, V6 = lapply(dfList, my_fun)) # Add V6 column to `dfList`

datasnames <- c("dat1","dat2","dat3") # write here the names of your data sets (in the same order they were specified in `dfList`)

names(temp) <- datasnames
list2env(temp, envir = .GlobalEnv)

Edit (Per OPs additional request in comments)

So there could be many ways to do this, one is to modify your function

my_fun <-  function(dataframe){
  V6 <- rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE);
  V7 <- sd(dataframe[, "V1"],na.rm=TRUE)
  return(cbind(V6, V7))
}

temp <- Map(cbind, dfList, lapply(dfList, my_fun)) # Add V6 and V7 columns to `dfList`
names(temp) <- datasnames
list2env(temp, envir = .GlobalEnv)
Share:
19,572
user2351480
Author by

user2351480

Updated on August 22, 2022

Comments

  • user2351480
    user2351480 over 1 year

    I want to loop a function over dataframes and add information from the function to the data frame. I have read the many relevant posts on looping and applying functions to dataframes that have got me close to what I need, but not quite there, so Im hoping people can help me.

    I have a number of dataframes that look like this

    dat1=as.data.frame(matrix(rnorm(25),ncol=5))
    dat2=as.data.frame(matrix(rnorm(25),ncol=5))
    dat3=as.data.frame(matrix(rnorm(25),ncol=5))
    

    I want to calculate the mean of some of these columns and add on to the end of the dataframe. So, I wrote a function to calculate means from a dataframe:

    my_fun <-  function(dataframe){
    rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE) 
    }
    

    To apply this function to one dataframe, I do this:

     dat1$V6<- my_fun(dat1)
    

    But I want to be able to loop through all dataframes and add this mean column on to the end of each df.

    After reading some helpful posts, I created a list and used sapply:

    dfList<-list(dat1,dat2,dat3)#create list
    sapply(dfList, my_fun) #apply function to list
    

    Which gives me the values that I want, but I dont want them in a separate output- I'd like them to simply add a column on to the original dataframe- as happens when I apply the function to the individual dataframes. Can anyone tell me how to do this? Or point me to a post that describes how to do this (I have searched high and low, but maybe typing the wrong keywords) Im sure its very straightforward if you know how!