Safely merge data frames by factor columns
The "safe guard" with merge
is the by=
parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by
value is set, R will do what it's asked. And if you don't explicitly set the by
parameter, then R will find all shared column names in the two data.frames and use that.
And many times when merging, you don't always expect matches. Sometimes you're using all.x
or all.y
to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.
So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.
krlmlr
Updated on July 26, 2022Comments
-
krlmlr almost 2 years
Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.
a <- factor(letters[1:3]) b <- factor(letters[1:3], levels=letters[4:1]) a == b ## Error in Ops.factor(a, b) : level sets of factors are different a < a ## [1] NA NA NA ## Warning message: ## In Ops.factor(a, a) : < not meaningful for factors
However, contrary to my expectation, this check is not performed when merging data frames:
ad <- data.frame(x=a, a=as.numeric(a)) bd <- data.frame(x=b, b=as.numeric(b)) merge(ad, bd) ## x a b ## 1 a 1 4 ## 2 b 2 3 ## 3 c 3 2
Those factors simply seem to be coerced to characters.
Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?
Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.
-
MrFlick almost 10 yearsI don't see where
data.frame
is removing thefactor
attribute from the initial values. What is the evidence of that? -
Carl Witthoft almost 10 years@MrFlick try running
attributes(a)
andattributes(ad)
. Please don't dump on posts without investigating first. -
MrFlick almost 10 yearsI wasn't dumping; I did investigate first which is why I was confused. I had compared
attributes(a)
andattributes(ad$x)
which were identical. I was surprised to see you looking atattributes(ad)
. Adata.frame
is just a collection of vectors. Thedata.frame
itself doesn't take on the attributes of any one of its columns nor does attributes() recursively investigate the sub-elements. Adata.frame
can hold factors with different levels:dd<-data.frame(x=letters[1:2], y=letters[11:12]);levels(dd$x); levels(dd$y)
-
Carl Witthoft almost 10 years@MrFlick My apologies -- hoist by my own petard. I'll rewrite my answer.
-
krlmlr almost 10 yearsThanks. I wouldn't have expected that. So, the behavior is to create a new factor that contains the levels found in both data frames (modulo
all.x
orall.y
). Unfortunately, this doesn't seem to be documented (a coarse search in the docs forfactor
yields zero hits), but that's just another minor issue. The original question remains: Why??? (And is there an alternative that will do more strict checking?)