Safely merge data frames by factor columns

11,795

The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.

And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.

So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.

Share:
11,795
krlmlr
Author by

krlmlr

Updated on July 26, 2022

Comments

  • krlmlr
    krlmlr almost 2 years

    Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.

    a <- factor(letters[1:3])
    b <- factor(letters[1:3], levels=letters[4:1])
    a == b
    ## Error in Ops.factor(a, b) : level sets of factors are different
    a < a
    ## [1] NA NA NA
    ## Warning message:
    ## In Ops.factor(a, a) : < not meaningful for factors
    

    However, contrary to my expectation, this check is not performed when merging data frames:

    ad <- data.frame(x=a, a=as.numeric(a))
    bd <- data.frame(x=b, b=as.numeric(b))
    merge(ad, bd)
    ##   x a b
    ## 1 a 1 4
    ## 2 b 2 3
    ## 3 c 3 2
    

    Those factors simply seem to be coerced to characters.

    Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?

    Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.

  • MrFlick
    MrFlick almost 10 years
    I don't see where data.frame is removing the factor attribute from the initial values. What is the evidence of that?
  • Carl Witthoft
    Carl Witthoft almost 10 years
    @MrFlick try running attributes(a) and attributes(ad) . Please don't dump on posts without investigating first.
  • MrFlick
    MrFlick almost 10 years
    I wasn't dumping; I did investigate first which is why I was confused. I had compared attributes(a) and attributes(ad$x) which were identical. I was surprised to see you looking at attributes(ad). A data.frame is just a collection of vectors. The data.frame itself doesn't take on the attributes of any one of its columns nor does attributes() recursively investigate the sub-elements. A data.frame can hold factors with different levels: dd<-data.frame(x=letters[1:2], y=letters[11:12]);levels(dd$x); levels(dd$y)
  • Carl Witthoft
    Carl Witthoft almost 10 years
    @MrFlick My apologies -- hoist by my own petard. I'll rewrite my answer.
  • krlmlr
    krlmlr almost 10 years
    Thanks. I wouldn't have expected that. So, the behavior is to create a new factor that contains the levels found in both data frames (modulo all.x or all.y). Unfortunately, this doesn't seem to be documented (a coarse search in the docs for factor yields zero hits), but that's just another minor issue. The original question remains: Why??? (And is there an alternative that will do more strict checking?)