How to merge two data.frames together in R, referencing a lookup table

17,183

Solution 1

You can just do this:

merge(test_growth_series_LUT, test_growth_series)

It will automatically match the names. If you need to specify the column, you do it like this:

merge(test_growth_series_LUT, test_growth_series, by = "series_id")

Or this way if you need to specify on both sides (only needed if they have different names that you want to match on):

merge(test_growth_series_LUT, test_growth_series, by.x = "series_id", by.y = "series_id")

I recommend looking at the examples (and walking through them) by going to the help for merge (?merge) or by calling example("merge", "base") (less useful that actually walking through it yourself.

Two notes:

  1. You would never need to use the intersect function here. Use c() to specify multiple column names explicitly. Or use the all, all.x, and all.y parameters to specify what kind of join you want.
  2. You would use quotes to specify a column name in most cases unless you have attached the data. Otherwise it will complain about not being able to locate the name. In particular, the name needs to be in the search path when you aren't using quotes.

Solution 2

The error I'm getting is "Error in as.vector(y) : object 'series_id' not found"

A column in your data.frame can be referred to like this: test_growth_series$series_id, which returns the vector of series_id's. Doing the intersect is unnecessary, but would be correctly written like this:

intersect(test_growth_series$series_id, test_growth_series_LUT$series_id)

To be slightly more correct, you probably want to do a left join by using all.x=TRUE. This covers you in case a series_id from test_growth_series doesn't appear in your look up table. Without it, you might end up missing rows in your result.

merge(test_growth_series, test_growth_series_LUT, by = "series_id", all.x=TRUE)

This topic is also discussed in How to join data frames in R (inner, outer, left, right)?

Share:
17,183
John
Author by

John

Updated on July 03, 2022

Comments

  • John
    John almost 2 years

    I am trying to merge two data.frames together, based on a common column name in each of them called series_id. Here is my merge statement:

    merge(test_growth_series_LUT,  test_growth_series, by = intersect(series_id, series_id))
    

    The error I'm getting is

    Error in as.vector(y) : object 'series_id' not found

    The help gives this description, but I can't see why it can't find the series_id. Example data is below.

    ### S3 method for class 'data.frame':
       #merge(x, y, by = intersect(names(x), names(y)),
       #      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
       #      sort = TRUE, suffixes = c(".x",".y"), ...)
    
    
    
    # Create a long data.frame to store data...
    test_growth_series = data.frame ("read_day" = c(0, 3, 9, 0, 3, 9, 0, 2, 8), 
    "series_id" = c("p1s1", "p1s1", "p1s1", "p1s2", "p1s2", "p1s2", "p3s4", "p3s4", "p3s4"),
    "mean_od" = c(0.6, 0.9, 1.3, 0.3, 0.6, 1.0, 0.2, 0.5, 1.2),
    "sd_od" = c(0.1, 0.2, 0.2, 0.1, 0.1, 0.3, 0.04, 0.1, 0.3),
    "n_in_stat" = c(8, 8, 8, 8, 7, 5, 8, 7, 2)
    )
    
    # Create a name LUT
    test_growth_series_LUT = data.frame ("series_id" = c("p1s1", "p1s2", "p3s4", "p4s2", "p5s2", "p6s2", "p7s4", "p8s4", "p9s4"),"description" = c("blah1", "blah2", "blah3", "blah4", "blah5", "blah6", "blah7", "blah8", "blah9")
    )
    
    > test_growth_series
      read_day series_id mean_od sd_od n_in_stat
    1        0      p1s1     0.6  0.10         8
    2        3      p1s1     0.9  0.20         8
    3        9      p1s1     1.3  0.20         8
    4        0      p1s2     0.3  0.10         8
    5        3      p1s2     0.6  0.10         7
    6        9      p1s2     1.0  0.30         5
    7        0      p3s4     0.2  0.04         8
    8        2      p3s4     0.5  0.10         7
    9        8      p3s4     1.2  0.30         2
    > test_growth_series_LUT
      series_id description
    1      p1s1       blah1
    2      p1s2       blah2
    3      p3s4       blah3
    4      p4s2       blah4
    5      p5s2       blah5
    6      p6s2       blah6
    7      p7s4       blah7
    8      p8s4       blah8
    9      p9s4       blah9
    > 
    
    
    
    this is what I'm trying to achieve:  
    > new_test_growth_series
      read_day series_id mean_od sd_od n_in_stat        description
    1        0      p1s1     0.6  0.10         8        blah1
    2        3      p1s1     0.9  0.20         8        blah1
    3        9      p1s1     1.3  0.20         8        blah1
    4        0      p1s2     0.3  0.10         8        blah2
    5        3      p1s2     0.6  0.10         7        blah2
    6        9      p1s2     1.0  0.30         5        blah2
    7        0      p3s4     0.2  0.04         8        blah3
    8        2      p3s4     0.5  0.10         7        blah3
    9        8      p3s4     1.2  0.30         2        blah3