Select only the first row when merging data frames with multiple matches
Solution 1
Using data.table
along with mult = "first"
and nomatch = 0L
:
require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference
scores[data, mult = "first", on = "id", nomatch=0L]
# id score state
# 1: 1 66 KS
# 2: 2 86 MN
# 3: 3 76 AL
For each row on data
's id
column, the matching rows in scores
' id
column are found, and the first one alone is retained (because mult = "first"
). If there are no matches, they're removed (because of nomatch = 0L
).
Solution 2
Here is a base R method using aggregate
and head
:
merge(data, aggregate(score ~ id, data=scores, head, 1), by="id")
The aggregate
function breaks up the scores dataframe by id, then head
is applied to get the first observation from each id. Since aggregate
returns a data.frame, this is directly merged onto the data.frame data.
Probably more efficient is to subset the scores data.frame using duplicated
which will achieve the same result as aggregate
, but will reduce the computational overhead.
merge(data, scores[!duplicated(scores$id),], by="id")
Solution 3
Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.
data = data.frame(id=c(1,2,3,4,5),
state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1 1 KS 66
# 2 2 MN 86
# 3 3 AL 76
# 4 4 FL NA
# 5 5 CA NA
Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1 1 KS 66
# 2 2 MN 86
# 3 3 AL 76
# 4 4 FL 0
# 5 5 CA 0
Solution 4
In base you can use match to Select only the first row when merging data frames with multiple matches.
#Return also those which found no match
(tt <- cbind(data, score=scores[match(data$id, scores$id),"score"]))
# id state score
#1 1 KS 66
#2 2 MN 86
#3 3 AL 76
#4 4 FL NA
#5 5 CA NA
#Return only those which found a match
tt[!is.na(tt$score),]
# id state score
#1 1 KS 66
#2 2 MN 86
#3 3 AL 76
Related videos on Youtube
AGUY
Updated on September 15, 2022Comments
-
AGUY over 1 year
I have two data frames, "data" and "scores", and want to merge them on the "id" column:
data = data.frame(id = c(1,2,3,4,5), state = c("KS","MN","AL","FL","CA")) scores = data.frame(id = c(1,1,1,2,2,3,3,3), score = c(66,75,78,86,85,76,75,90)) merge(data, scores, by = "id") semi_join(data, scores, by = "id")
In the "scores" data, there are "id" with multiple observations, where each match gets a row following the join. See
?merge
:If there is more than one match, all possible matches contribute one row each.
However, I want keep only the row corresponding to the first match from the
scores
table.A semi join would have been nice, but I'm not able to select the score from the right table.
Any suggestions?
-
mtoto almost 8 yearsHow about using a keyed
scores
data.table:setDT(data); scores <- data.table(scores, key = "id"); unique(scores)[data, nomatch = 0L]
-
Arun almost 8 yearsIt'd work fine, but is just very inefficient.
data.table()
results in a deep copy. The key argument would reorder the entire data.table (inefficient unless case can be made for reuse and probably undesirable).unique()
results in unnecessary intermediate data. Andon=
also allows for a clean syntax to just look at it and understand what the join column is (although it can be used on keyed data.tables as well). I recommend reading the secondary indices vignette. -
Huanfa Chen about 7 yearsThis solution is good, but it is inefficient. I would use dplyr::distinct to replace aggregate.
-
lmo about 7 yearsThanks @huanfa-chen. Your suggestion gave me the idea to use
duplicated
in place ofaggregate + head
which will certainly be more efficient. -
Orhan Yazar over 4 yearsHi @Arun is it possible to tell
mult
to take the highest value according to another column instead of the first ? -
Arun over 4 years@OrhanYazar not with rolling joins, but possibly with non-equi joins. Post a Q and tag it here maybe? I (or other so many able people here on SO) could take a look.
-
skan almost 3 yearsThe mult = "first" option doesn't work if you change the order of the datatables data[scores, mult = "first", on = "id", nomatch=0L]