merging data.tables based on columns names
Solution 1
Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table()
does accept and nicely handles arguments by.x=
and by.y=
. Here's an updated link to the FR (now closed) referenced below.
Yes this is a feature request not yet implemented :
FR#2033 Add by.x and by.y to merge.data.table
There isn't anything preventing it. Just something that wasn't done. I very rarely need merge
and was slow to realise its usefulness more generally. We've made good progress in bringing merge
performance as fast as X[Y]
, and this feature request is at the highest priority. If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table
and commit the change yourself. We try to keep source code short and together in one function/file, so by looking at merge.data.table
source hopefully you can follow it and see what needs to be done.
Solution 2
The arguments by.x
and by.y
are now available in the development version of data.table
. See here. Use devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE)
to install the development version of data.table
.
Solution 3
You can't because the by columns must be in the intersection of colnames(DT) and colnames(DT1)
if (!all(by %in% intersect(colnames(x), colnames(y)))) {
stop("Elements listed in `by` must be valid column names in x and y")
}
Here using setnames , which which does not copy and is very fast
setnames(DT1,'y1','y')
> merge(DT,DT1)
y x v x1 v1
1: 1 a 1 aa 1
2: 1 b 4 aa 1
3: 1 c 7 aa 1
4: 3 a 2 bb 2
5: 3 b 5 bb 2
6: 3 c 8 bb 2
7: 6 a 3 cc 3
8: 6 b 6 cc 3
9: 6 c 9 cc 3
EDIT update with data.table version data.table 1.9.4
you should set the by
parameter otherwise you get an error:
Error in merge.data.table(DT, as.data.table(DT1)) :
Elements listed in `by` must be valid column names in x and y
You should do something like :
merge(DT,DT1,by="y")
Related videos on Youtube
statquant
Updated on June 04, 2022Comments
-
statquant almost 2 years
I am trying to do some left-join merges with data.tables. The package description quote that
In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order
I understand that I can use
.data.table[
anddata.table:::merge.data.table
What I would like is : merge X and Y specifying the keys (like by.x and by.y in base merge, ->why taking this away ?)
Let's suppose I have
DT = data.table(x=rep(c("a","b","c"),each=3),y=c(1,3,6),v=1:9,key="x,y,v") DT1 = data.frame(x1=c("aa","bb","cc"),y1=c(1,3,6),v1=1:3,key="x1,y1,v1")
and I would like this output:
#data.table:::merge is masking I don't know how to call the base version of merge anymore R) {base::merge}(DT,DT1,by.x="y",by.y="y1") y x v x1 v1 1 1 a 1 aa 1 2 1 c 7 aa 1 3 1 b 4 aa 1 4 3 a 2 bb 2 5 3 b 5 bb 2 6 3 c 8 bb 2 7 6 b 6 cc 3 8 6 a 3 cc 3 9 6 c 9 cc 3
I am very happy to use
[
ordata.table:::merge
but I would like an option that do not modifyDT
orDT1
(like changing the column names and calling merge and changing it back)-
Matt Dowle over 11 years
merge.data.table
is a method for the S3 generic base functionmerge
. To call the base merge,merge.data.frame(DT,DT1,by.x="y",by.y="y1")
should work. But see my answer too.
-
-
statquant over 11 yearsYes, but then I have to set back the column names... Is there a way of doing it with
[
as I might need thenomatch
option ? -
agstudy over 11 years@statquant I need to investiagte for the '[' solution. I am not a yet data.table user..Your want the '[' because it is more elegant?
-
statquant over 11 yearsActually the
[
is faster thanmerge
as merge lookup bothX
andY
.data.table
is not very clear with merge stuff, it is lacking a good FAQ-merge. -
Matt Dowle over 11 years@statquant Agreed,
data.table
is missing many things: 104 feature requests outstanding for example. Although, many of those are really TODO items than features per se. -
statquant over 11 years@matthew I am looking at merges, I think I found a bug (may be a feature) as
merge.data.table
andmerge.data.frame
do not output same results for outer left and right joins -
statquant over 11 yearsI can try to have a look (It is probably above my level though), I would need to set up a svn/git though I guess...
-
Matt Dowle over 11 years@statquant Good, will take a look. An email to datatable-help asking if it's a bug is the best course if possible. Questions on S.O. are not supposed to be 'specific to a point in time' - that's one of the reasons to close a question. IIUC SO etiquette. But personally I don't mind any method, just grateful for the bug report is the main thing.
-
Matt Dowle over 11 yearsI doubt it's much beyond your level, the source of merge is just R and X[Y] calls, which you are getting to know already. It might be a good exercise to pick off actually. If
data.table
was on github would it be easier for you? -
statquant over 11 yearscan you read the message bellow because I fear a bug in the Y[X] itself (or a feature) but if you look at left outer join bellow Y[X] shows ligns it should not :( (I hope I am wrong)
-
Matt Dowle over 11 years@statquant Oops, ignore previous long comment. I didn't look closely enough. It seems to matching incorrectly doesn't it. Worse than I thought. Will take a look...
-
statquant over 11 yearsYes I think there is a problem, in the left outer join the matching seems to be done incorrectly as on lign 1 depID=NA gets a depNane=Eng and on lign2 name=Raf looses it's depName (=NA instead of Sal).
-
statquant over 11 yearsShould I repost this though or not ?
-
Matt Dowle over 11 years@statquant Yes please. Then it can be linked to the other NA in key questions, linked to bug report etc. It's quite distinct from this question.
-
David Arenburg over 8 yearsThanks for the useful PR.
-
Ben about 8 yearsThis is now version 1.9.6