pandas three-way joining multiple dataframes on columns

419,755

Solution 1

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, ..., dfN]

Assuming they have a common column, like name in your example, I'd do the following:

import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Solution 2

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

Solution 3

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero's data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

Solution 4

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt's answer

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

Solution 5

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')
Share:
419,755

Related videos on Youtube

lollercoaster
Author by

lollercoaster

Updated on April 30, 2022

Comments

  • lollercoaster
    lollercoaster about 2 years

    I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

    How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?

    The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

    • cwharland
      cwharland almost 10 years
      You don't need a multiindex. It states in the join docs that of you don't have a multiindex when passing multiple columns to join on then it will handle that.
    • lollercoaster
      lollercoaster almost 10 years
      In my trials, df1.join([df2, df3], on=[df2_col1, df3_col1]) didn't work.
    • cwharland
      cwharland almost 10 years
      You need to chain them together like in the answer given. Merge df1 and df2 then merge the result with df3
  • cwharland
    cwharland almost 10 years
    For cleaner looks you can chain them df1.merge(df2,on='name').merge(df3,on='name')
  • MattR
    MattR almost 7 years
    I just tried using this and it failed because reduce was replaced with functools.reduce So import functools functools.reduce(.......)
  • ps0604
    ps0604 about 6 years
    How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have name1, name2 and name3 respectively.
  • ps0604
    ps0604 about 6 years
    How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have name1, name2 and name3 respectively
  • Sylhare
    Sylhare almost 6 years
    it's semantic, for someone using the word "join" to say putting together the two dataframe. (not necessarely as the SQL join operation)
  • Michael H.
    Michael H. almost 6 years
    @ps0604 df1.merge(df2,left_on='name1', right_on='name2').merge(df3,left_on='name1', right_on='name3').drop(columns=['name2', 'name3']).rename(columns={'name1':'name'})
  • eapolinario
    eapolinario almost 6 years
    Doesn't this mean that we have n-1 calls to the merge function? I guess in this case where the number of dataframes is small it doesn't matter, but I wonder if there's a more scalable solution.
  • Adrian Torrie
    Adrian Torrie over 5 years
    This didn't quite work for my dfs with column multi indexes (it was injecting the 'on' as a column which worked for the first merge, but subsequent merges failed), instead I got it to work with: df = reduce(lambda left, right: left.join(right, how='outer', on='Date'), dfs)
  • Brian D
    Brian D about 5 years
    and further, how to do this using the index. Doesn't seem to work if 'name' is the index and not a column name.
  • Dominik
    Dominik almost 5 years
    Joining all of the dfs to an empty dataframe also works: pd.DataFrame().join(dfs, how="outer"). This can be cleaner in some situations.
  • cs95
    cs95 almost 5 years
    This is decent advice and has now been incorporated into pandas merging 101 (see the section on merging multiple dataframes). It's worth noting that if your join keys are unique, using pd.concat will result in simpler syntax: pd.concat([df.set_index('name') for df in dfs], axis=1, join='inner').reset_index(). concat is also more versatile when dealing with duplicate column names across multiple dfs (join isn't as good at this) although you can only perform inner or outer joins with it.
  • gies0r
    gies0r about 4 years
    dfs[0].join(dfs[1:]) should be edited to dfs[0].join(dfs[1:], sort=False) because otherwise a FutureWarning will pop up. Thanks for the nice example.
  • R. Zhu
    R. Zhu about 4 years
    This should be the accepted answer. It's the fastest.
  • steve
    steve almost 4 years
    +1 to ps0604. what if the join columns are different, does this work? should we go with pd.merge incase the join columns are different? thanks
  • SomJura
    SomJura almost 4 years
    I get an error on trying that: ValueError: Indexes have overlapping values, although, by inspection of the individual dataframes in the list, they don't seem to have overlapping values.
  • John Curry
    John Curry over 3 years
    Nice method. See correction below in MergeDfDict: keys = dfDict.keys(); i = 0; for key in keys:
  • haneulkim
    haneulkim about 3 years
    Thanks! What if df0,df1 have same columns to merge on and df0,df2 have same columns to merge on?
  • Marukox
    Marukox almost 3 years
    Tweaked approach is great; however, a small fix must be added to avoid ValueError: too many values to unpack (expected 2), a left suffices as empty string "". The final merge function could be as follow: merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=('', sfx)) # Left gets no suffix, right gets something identifiable
  • Abhilash Ramteke
    Abhilash Ramteke over 2 years
    What if dataframe shapes are different?
  • Dr Fabio Gori
    Dr Fabio Gori over 2 years
    @AbhilashRamteke If you mean that they have different number or rows (so the name column is not the same in all data frames) then join='outer' should preserve them all, but you will have missing values. No issues with respect to different column sets, as long as they all share the name column, which is used for index
  • Alex S.
    Alex S. about 2 years
    Thanks! This answer preserves the columns, other than the one mentioned in on, which is absolutely correct. concat wouldn't do it.
  • curiousguy
    curiousguy about 2 years
    What if they have no same column, but they only have same default column at the beginning of the dataframe? How to deal with this?