Left join using data.table

54,191

Solution 1

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a') 

B[A]

Solution 2

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A
   a  b bb
1: 1 12 NA
2: 2 13 13
3: 3 14 14
4: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A)
[1] "0x102afa5d0"
> A[B, on = 'a', bb := i.b]
> address(A)
[1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A)
[1] "0x102abae50"
> A <- B[A, on = 'a']
> address(A)
[1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A)
[1] "0x111897e00"
> A <- merge(A, B, by = 'a', all.x = TRUE)
> address(A)
[1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b] 

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
           AA[BB, on = .(aa), cc := cc],
           iterations = 1)

which gives (only relevant measurements shown):

> bm[,c(1,3,5)]
# A tibble: 2 x 3
  expression                         median mem_alloc
  <bch:expr>                       <bch:tm> <bch:byt>
1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB
2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

NOTE: Gains in speed and memory use might differ in different setups.


Used data:

# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)

# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))
Share:
54,191
lord12
Author by

lord12

Updated on July 05, 2022

Comments

  • lord12
    lord12 almost 2 years

    Suppose I have two data.table's:

    A:

      A  B
    1: 1 12
    2: 2 13
    3: 3 14
    4: 4 15
    

    B:

       A  B
    1: 2 13
    2: 3 14
    

    and I have the following code:

    merge_test = merge(dataA, dataB, by="A", all.data=TRUE)
    

    I get:

       A B.x B.y
    1: 2  13  13
    2: 3  14  14
    

    However, I want all the rows in dataA in the final merged table. Is there a way to do this?

  • cbailiss
    cbailiss about 7 years
    Great answer. Just to confirm, I assume the "i" in "A[B, bb:=i.b, on='a']" refers to the"i" in the general data.table "DT[i, j, by]" syntax?
  • Jaap
    Jaap about 7 years
    @cbailiss Yes, i.b mean that in updating A with the join it should look t the b-column of B. In a similar way, with the x. prefix you can refer to columns of A.
  • Prevost
    Prevost over 5 years
    @Jaap How would you manage the join by reference when there are multiple new columns created? Here the new column bb := i.b is created, which as you stated looks up the corresponding b column value in the B data.table corresponding to i. But what happens when you have many new columns that would potentially be created from merging (by reference) larger data.tables?
  • Jaap
    Jaap over 5 years
    @Prevost see here for an example, I hope that answers your question
  • Prevost
    Prevost over 5 years
    @Jaap It does answer it, thank you. I was hoping for a less typing intensive way, but I guess the trade off is typing for a by reference update. If I have been piping data.table operations to create columns I typically pipe in a merge() where available, which avoids assigning a new data.table. But with smaller data.tables or where a new data.table must be created from a join I may use your answer above or stick with merge(). Thanks for the quick comment!
  • Jaap
    Jaap over 5 years
    @Prevost the tric is in using mget, see also the last part of my answer under the link from my previous comment
  • Lodyk Vovchak
    Lodyk Vovchak almost 4 years
    This answer works fine if one data table key is subset of the other. Is there possibility to join if they intersect partially? For example if A, B are like: A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:5, c = 13:16)
  • Matthew Son
    Matthew Son over 3 years
    I've never seen this i. and x. syntax from data.table documentation before. Great answer
  • nate-m
    nate-m about 2 years
    this is great! thanks - I had not heard of this package.
  • gernophil
    gernophil almost 2 years
    Can you explan how this works? Within tidyverse I could do left_join(table1, table2, by = c("colname" = "colname"). This doesn't work with table.express. I need an on = argument. However, I can't find any in the documentation.
  • ToWii
    ToWii almost 2 years
    @gernophil please have a look at the join documentation for table.express. The second argument in left_join, which is by="A" above, is the on = argument you refer to. Optionally, you can use the variable name without quotes.