Merge 2 dataframes by matching dates
Solution 1
Instead of using merge
with data.table
, you can also simply join as follows:
setDT(df1)
setDT(df2)
df2[df1, on = c('id','dates')]
this gives:
> df2[df1]
id dates field1 field2
1: MUM-1 2015-07-10 1 0
2: MUM-1 2015-07-11 NA NA
3: MUM-1 2015-07-12 2 1
4: MUM-2 2014-01-14 4 3
5: MUM-2 2014-01-15 NA NA
6: MUM-2 2014-01-16 NA NA
7: MUM-2 2014-01-17 0 1
Doing this with dplyr
:
library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))
As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:
dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]
As can be seen, @Arun's approach is slightly faster:
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
jaap = dt2[dt1, on = c('id','dates')],
pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
dplr = left_join(dt1, dt2, by=c("id","dates")),
arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])
test elapsed relative
4 arun 0.015 1.000
1 jaap 0.016 1.067
3 dplr 0.037 2.467
2 pavo 1.033 68.867
For a comparison on a large dataset, see the answer of @Arun.
Solution 2
I'd update df1
directly by reference as follows:
require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") :=
.(field1, field2), on=c("id", "dates")]
> df1
# id dates fiedl1 field2
# 1: MUM-1 2015-07-10 1 0
# 2: MUM-1 2015-07-11 NA NA
# 3: MUM-1 2015-07-12 2 1
# 4: MUM-2 2014-01-14 4 3
# 5: MUM-2 2014-01-15 NA NA
# 6: MUM-2 2014-01-16 NA NA
# 7: MUM-2 2014-01-17 0 1
This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.
Updated with a slightly bigger dataset than @Jaap's updated benchmark:
set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]
# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
# user system elapsed
# 0.209 0.067 0.277
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
# user system elapsed
# 119.911 0.530 120.749
# this answer
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
# user system elapsed
# 0.087 0.013 0.100
sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2
# loaded via a namespace (and not attached):
# [1] magrittr_1.5 R6_2.1.0 assertthat_0.1 parallel_3.2.1 DBI_0.3.1
# [6] tools_3.2.1 Rcpp_0.12.0 chron_2.3-45
Dint expect dplyr
to be ~1200x slower though.
Solution 3
You can convert both data frames to data tables, and then perform a merge:
library(data.table)
setDT(df1); setDT(df2)
merge(df1, df2, by = "id", allow.cartesian = TRUE)
the allow.cartesian
part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table
.
Solution 4
I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.
You can also use dplyr
's left_join
with data.frames, as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables. In other words, use dplyr
with a data.table structure in the background.
As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key. Then I'll perform various merges and count the time:
library(data.table)
library(dplyr)
# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)
dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)
# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)
dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)
# check your current data tables (note some have keys)
tables()
# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm
# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm
# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm
# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm
# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm
# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm
Dheeraj Singh
Updated on November 18, 2020Comments
-
Dheeraj Singh over 3 years
I have two dataframes:
id dates MUM-1 2015-07-10 MUM-1 2015-07-11 MUM-1 2015-07-12 MUM-2 2014-01-14 MUM-2 2014-01-15 MUM-2 2014-01-16 MUM-2 2014-01-17
and:
id dates field1 field2 MUM-1 2015-07-10 1 0 MUM-1 2015-07-12 2 1 MUM-2 2014-01-14 4 3 MUM-2 2014-01-17 0 1
merged data:
id dates field1 field2 MUM-1 2015-07-10 1 0 MUM-1 2015-07-11 na na MUM-1 2015-07-12 2 1 MUM-2 2014-01-14 4 3 MUM-2 2014-01-15 na na MUM-2 2014-01-16 na na MUM-2 2014-01-17 0 1
code:
merge(x= df1, y= df2, by= 'id', all.x= T)
I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.
-
PavoDive over 8 yearsIt's always good to post code showing what your approach is. Vistors (and OPs) to SO come looking for code!. Could you please elaborate (in code) what you meant by "data.table merging" and "setting keys". That will get the OP out of his problem
-
AntoniosK over 8 yearsI think that the type of data, number of rows and columns are some of the factors that affect speed. So, if that merging process has to be automated (maybe within a bigger process) we should try various merging alternatives and decide which is the best for our case. My example has only 2 columns with numeric values. I don't think I can generalise.
-
AntoniosK over 8 yearsIndeed it's hard to find out because I didn't show any results and I apologise for that. I prefer showing the code that someone can re-run on his own machine with his own data without my results (my data, my machine). I've seen that -in my industry- it makes people believe that what they see in an example is the (universal) optimal case and they don't bother running their own code (or variations of the example code).
-
Jaap over 8 yearsThat's indeed (slightly) faster. Added it to the benchmarks in my answer.
-
Arun over 8 years@VeerendraGadekar,
dt
is an entirely new object, allocated separately in memory. It's unnecessary in most cases. The concept of joins in data.table is easy to understand as extension from subsets.. For ex:X[a %in% 1:5, b := val]
andX[Y, b := val]
aren't that different, except we are used to calling it subset and joins. -
Arun over 8 years@VeerendraGadekar there's a
merge.data.table
method. People who're used to base R syntax might find it easy to follow. -
Veerendra Gadekar over 8 years@Arun nice! PacoDive that is not my down vote anyway :)