Left join using data.table
Solution 1
You can try this:
# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')
B[A]
Solution 2
If you want to add the b
values of B
to A
, then it's best to join A
with B
and update A
by reference as follows:
A[B, on = 'a', bb := i.b]
which gives:
> A a b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA
This is a better approach than using B[A, on='a']
because the latter just prints the result to the console. When you want to get the results back into A
, you need to use A <- B[A, on='a']
which will give you the same result.
The reason why A[B, on = 'a', bb := i.b]
is better than A <- B[A, on = 'a']
is memory efficiency. With A[B, on = 'a', bb := i.b]
the location of A
in memory stays the same:
> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0"
While on the other hand with A <- B[A, on = 'a']
, a new object is created and saved in memory as A
and hence has another location in memory:
> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30"
Using merge
(merge.data.table
) results in a similar change in memory location:
> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000"
For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:
A[B, on = 'a', bb := i.b]
Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table
was designed.
Probably also worth mentioning is that the order of A
stays the same.
To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):
library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
AA[BB, on = .(aa), cc := cc],
iterations = 1)
which gives (only relevant measurements shown):
> bm[,c(1,3,5)] # A tibble: 2 x 3 expression median mem_alloc <bch:expr> <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)] 4.98s 4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms 384.6MB
So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.
NOTE: Gains in speed and memory use might differ in different setups.
Used data:
# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)
# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))
lord12
Updated on July 05, 2022Comments
-
lord12 almost 2 years
Suppose I have two data.table's:
A:
A B 1: 1 12 2: 2 13 3: 3 14 4: 4 15
B:
A B 1: 2 13 2: 3 14
and I have the following code:
merge_test = merge(dataA, dataB, by="A", all.data=TRUE)
I get:
A B.x B.y 1: 2 13 13 2: 3 14 14
However, I want all the rows in dataA in the final merged table. Is there a way to do this?
-
cbailiss about 7 yearsGreat answer. Just to confirm, I assume the "i" in "A[B, bb:=i.b, on='a']" refers to the"i" in the general data.table "DT[i, j, by]" syntax?
-
Jaap about 7 years@cbailiss Yes,
i.b
mean that in updatingA
with the join it should look t theb
-column ofB
. In a similar way, with thex.
prefix you can refer to columns ofA
. -
Prevost over 5 years@Jaap How would you manage the join by reference when there are multiple new columns created? Here the new column
bb := i.b
is created, which as you stated looks up the correspondingb
column value in theB
data.table
corresponding toi
. But what happens when you have many new columns that would potentially be created from merging (by reference) largerdata.tables
? -
Jaap over 5 years@Prevost see here for an example, I hope that answers your question
-
Prevost over 5 years@Jaap It does answer it, thank you. I was hoping for a less typing intensive way, but I guess the trade off is typing for a by reference update. If I have been piping
data.table
operations to create columns I typically pipe in amerge()
where available, which avoids assigning a newdata.table
. But with smallerdata.tables
or where a newdata.table
must be created from a join I may use your answer above or stick withmerge()
. Thanks for the quick comment! -
Jaap over 5 years@Prevost the tric is in using
mget
, see also the last part of my answer under the link from my previous comment -
Lodyk Vovchak almost 4 yearsThis answer works fine if one data table key is subset of the other. Is there possibility to join if they intersect partially? For example if
A, B
are like:A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:5, c = 13:16)
-
Matthew Son over 3 yearsI've never seen this
i.
andx.
syntax fromdata.table
documentation before. Great answer -
nate-m about 2 yearsthis is great! thanks - I had not heard of this package.
-
gernophil almost 2 yearsCan you explan how this works? Within tidyverse I could do
left_join(table1, table2, by = c("colname" = "colname")
. This doesn't work with table.express. I need anon =
argument. However, I can't find any in the documentation. -
ToWii almost 2 years@gernophil please have a look at the join documentation for
table.express
. The second argument inleft_join
, which isby="A"
above, is theon =
argument you refer to. Optionally, you can use the variable name without quotes.