Select only the first row when merging data frames with multiple matches

17,473

Solution 1

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).

Solution 2

Here is a base R method using aggregate and head:

merge(data, aggregate(score ~ id, data=scores, head, 1), by="id") 

The aggregate function breaks up the scores dataframe by id, then head is applied to get the first observation from each id. Since aggregate returns a data.frame, this is directly merged onto the data.frame data.


Probably more efficient is to subset the scores data.frame using duplicated which will achieve the same result as aggregate, but will reduce the computational overhead.

merge(data, scores[!duplicated(scores$id),], by="id")

Solution 3

Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.

data = data.frame(id=c(1,2,3,4,5),
                  state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
                    score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL    NA
# 5  5    CA    NA

Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:

data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL     0
# 5  5    CA     0

Solution 4

In base you can use match to Select only the first row when merging data frames with multiple matches.

#Return also those which found no match
(tt <- cbind(data, score=scores[match(data$id, scores$id),"score"]))
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76
#4  4    FL    NA
#5  5    CA    NA

#Return only those which found a match
tt[!is.na(tt$score),]
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76
Share:
17,473

Related videos on Youtube

AGUY
Author by

AGUY

Updated on September 15, 2022

Comments

  • AGUY
    AGUY over 1 year

    I have two data frames, "data" and "scores", and want to merge them on the "id" column:

    data = data.frame(id = c(1,2,3,4,5),
                      state = c("KS","MN","AL","FL","CA"))
    scores = data.frame(id = c(1,1,1,2,2,3,3,3),
                        score = c(66,75,78,86,85,76,75,90))
    merge(data, scores, by = "id")                  
    semi_join(data, scores, by = "id")                  
    

    In the "scores" data, there are "id" with multiple observations, where each match gets a row following the join. See ?merge:

    If there is more than one match, all possible matches contribute one row each.

    However, I want keep only the row corresponding to the first match from the scores table.

    A semi join would have been nice, but I'm not able to select the score from the right table.

    Any suggestions?

  • mtoto
    mtoto almost 8 years
    How about using a keyed scores data.table: setDT(data); scores <- data.table(scores, key = "id"); unique(scores)[data, nomatch = 0L]
  • Arun
    Arun almost 8 years
    It'd work fine, but is just very inefficient. data.table() results in a deep copy. The key argument would reorder the entire data.table (inefficient unless case can be made for reuse and probably undesirable). unique() results in unnecessary intermediate data. And on= also allows for a clean syntax to just look at it and understand what the join column is (although it can be used on keyed data.tables as well). I recommend reading the secondary indices vignette.
  • Huanfa Chen
    Huanfa Chen about 7 years
    This solution is good, but it is inefficient. I would use dplyr::distinct to replace aggregate.
  • lmo
    lmo about 7 years
    Thanks @huanfa-chen. Your suggestion gave me the idea to use duplicated in place of aggregate + head which will certainly be more efficient.
  • Orhan Yazar
    Orhan Yazar over 4 years
    Hi @Arun is it possible to tell mult to take the highest value according to another column instead of the first ?
  • Arun
    Arun over 4 years
    @OrhanYazar not with rolling joins, but possibly with non-equi joins. Post a Q and tag it here maybe? I (or other so many able people here on SO) could take a look.
  • skan
    skan almost 3 years
    The mult = "first" option doesn't work if you change the order of the datatables data[scores, mult = "first", on = "id", nomatch=0L]