Python: pandas merge multiple dataframes
Solution 1
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
Solution 2
Looks like the data has the same columns, so you can:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.concat([df1, df2])
Solution 3
functools.reduce and pd.concat are good solutions but in term of execution time pd.concat is the best.
from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, ...]
nan_value = 0
# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)
# solution 2
result_2 = reduce(lambda df_left,df_right: pd.merge(df_left, df_right,
left_index=True, right_index=True,
how='outer'),
dfs).fillna(nan_value)
Solution 4
There are 2 solutions for this, but it return all columns separately:
import functools
dfs = [df1, df2, df3]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
date a_x b_x a_y b_y c_x a b c_y
0 May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
0_a 0_b 1_a 1_b 1_c 2_a 2_b 2_c
date
May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
Solution 5
Another way to combine: functools.reduce
From documentation:
For example,
reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
calculates ((((1+2)+3)+4)+5). The left argument, x, is the accumulated value and the right argument, y, is the update value from the iterable.
So:
from functools import reduce
dfs = [df1, df2, df3, df4, df5, df6]
df_final = reduce(lambda left,right: pd.merge(left,right,on='some_common_column_name'), dfs)
Vasco Ferreira
I've recently finished my Masters Degree in Business Management. I am a newbie programmer, I just like to learn and it helps me to analyze a problem from a different perspective. The first program I have made, I've made it using Python. The program just gets data from an excel file and then prepares the data to be imported to the company's website. This program reduced to 20 minutes what would instead take 75 minutes of work. Now I am learning what AI is, be it for games, work or personal projects. What I have been learning so far: Using a Linux system - Basic Level; Using Spacemacs (Emacs) - Basic Level; Excel VBA; Python Programming Pandas library; Data Analysis Machine Learning: Keras library: ANN; CNN; LSTM. Pytorch Scraping websites with Scrapy; Download data from websites with Requests and BeautifulSoup; Creating a website with Django; Creating an API with flask; Clojure and Clojurescript Web-apps with Re-Frame API Functional programming mindset Statistics Monte Carlo Method
Updated on January 19, 2022Comments
-
Vasco Ferreira over 2 years
I have diferent dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use
df1.merge(df2, on='date')
, to do it with three dataframes, I usedf1.merge(df2.merge(df3, on='date'), on='date')
, however it becomes really complex and unreadable to do it with multiple dataframes.All dataframes have one column in common -
date
, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?
I tried diferent ways and got errors like
out of range
,keyerror 0/1/2/3
andcan not merge DataFrame with instance of type <class 'NoneType'>
.This is the script I wrote:
dfs = [df1, df2, df3] # list of dataframes def mergefiles(dfs, countfiles, i=0): if i == (countfiles - 2): # it gets to the second to last and merges it with the last return dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date') return dfm print(mergefiles(dfs, len(dfs)))
An example: df_1:
May 19, 2017;1,200.00;0.1% May 18, 2017;1,100.00;0.1% May 17, 2017;1,000.00;0.1% May 15, 2017;1,901.00;0.1%
df_2:
May 20, 2017;2,200.00;1000000;0.2% May 18, 2017;2,100.00;1590000;0.2% May 16, 2017;2,000.00;1230000;0.2% May 15, 2017;2,902.00;1000000;0.2%
df_3:
May 21, 2017;3,200.00;2000000;0.3% May 17, 2017;3,100.00;2590000;0.3% May 16, 2017;3,000.00;2230000;0.3% May 15, 2017;3,903.00;2000000;0.3%
Expected merge result:
May 15, 2017; 1,901.00;0.1%; 2,902.00;1000000;0.2%; 3,903.00;2000000;0.3%
-
Vasco Ferreira almost 7 yearsbut in this way it can only get the result for 3 files. What if I try with 4 files? Do I need to do:
set(df1.loc[:, 0].intersection(set(df3.loc[:, 0]).intersection(set(df2.loc[:, 0])).intersection(set(df1.loc[:, 0])))
? -
zipa almost 7 years@VascoFerreira I edited the code to match that situation as well.
-
steve almost 4 yearswhat if the join columns are different, does this work? should we go with pd.merge incase the join columns are different?
-
Nicolas Martinez almost 4 yearsJust a little note: If you're on python3 you need to import reduce from functools
-
Gerard over 3 yearsBu what if you dont have the same columns?
-
Pogger over 3 yearsNice. If have same column to merge on we can use it.
-
jengeb over 3 yearsIn addition to what @NicolasMartinez mentioned:
from functools import reduce # only in Python 3
-
cikatomo almost 3 yearsdo you use
on=...
? -
Ziu almost 3 yearsconcat can auto join by index, so if you have same columns ,set them to index @Gerard
-
spectre about 2 years@Ismail Hachimi But
pd.concat
cannot left merge. So for people who want to left merge multiple dataframesfunctools.reduce
is the best way to go