Select only the first row when merging data frames with multiple matches

r join

17,473

Solution 1

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).

Solution 2

Here is a base R method using aggregate and head:

merge(data, aggregate(score ~ id, data=scores, head, 1), by="id")

The aggregate function breaks up the scores dataframe by id, then head is applied to get the first observation from each id. Since aggregate returns a data.frame, this is directly merged onto the data.frame data.

Probably more efficient is to subset the scores data.frame using duplicated which will achieve the same result as aggregate, but will reduce the computational overhead.

merge(data, scores[!duplicated(scores$id),], by="id")

Solution 3

Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.

data = data.frame(id=c(1,2,3,4,5),
                  state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
                    score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL    NA
# 5  5    CA    NA

Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:

data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL     0
# 5  5    CA     0

Solution 4

In base you can use match to Select only the first row when merging data frames with multiple matches.

#Return also those which found no match
(tt <- cbind(data, score=scores[match(data$id, scores$id),"score"]))
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76
#4  4    FL    NA
#5  5    CA    NA

#Return only those which found a match
tt[!is.na(tt$score),]
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76

View more solutions

17,473

AGUY

Updated on September 15, 2022

Comments

AGUY over 1 year
I have two data frames, "data" and "scores", and want to merge them on the "id" column:
```
data = data.frame(id = c(1,2,3,4,5),
                  state = c("KS","MN","AL","FL","CA"))
scores = data.frame(id = c(1,1,1,2,2,3,3,3),
                    score = c(66,75,78,86,85,76,75,90))
merge(data, scores, by = "id")                  
semi_join(data, scores, by = "id")                  
```
In the "scores" data, there are "id" with multiple observations, where each match gets a row following the join. See ?merge:

If there is more than one match, all possible matches contribute one row each.

However, I want keep only the row corresponding to the first match from the scores table.

A semi join would have been nice, but I'm not able to select the score from the right table.

Any suggestions?
mtoto almost 8 years

How about using a keyed scores data.table: setDT(data); scores <- data.table(scores, key = "id"); unique(scores)[data, nomatch = 0L]
Arun almost 8 years

It'd work fine, but is just very inefficient. data.table() results in a deep copy. The key argument would reorder the entire data.table (inefficient unless case can be made for reuse and probably undesirable). unique() results in unnecessary intermediate data. And on= also allows for a clean syntax to just look at it and understand what the join column is (although it can be used on keyed data.tables as well). I recommend reading the secondary indices vignette.
Huanfa Chen about 7 years

This solution is good, but it is inefficient. I would use dplyr::distinct to replace aggregate.
lmo about 7 years

Thanks @huanfa-chen. Your suggestion gave me the idea to use duplicated in place of aggregate + head which will certainly be more efficient.
Orhan Yazar over 4 years

Hi @Arun is it possible to tell mult to take the highest value according to another column instead of the first ?
Arun over 4 years

@OrhanYazar not with rolling joins, but possibly with non-equi joins. Post a Q and tag it here maybe? I (or other so many able people here on SO) could take a look.
skan almost 3 years

The mult = "first" option doesn't work if you change the order of the datatables data[scores, mult = "first", on = "id", nomatch=0L]