Pandas merge two dataframes with different columns
Solution 1
I think in this case concat
is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0
here you are stacking the df's on top of each other which I believe is what you want then producing NaN
value where they are absent from their respective dfs.
Solution 2
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A
has 3x trial
columns, which prevents concat
:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat
:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
Solution 3
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')
Related videos on Youtube
Comments
-
economy over 2 years
I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may id quantity attr_1 attr_2 0 1 20 0 1 1 2 23 1 1 2 3 19 1 1 3 4 19 0 0 >df_jun id quantity attr_1 attr_3 0 5 8 1 0 1 6 13 0 1 2 7 20 1 1 3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (
on = "id"
, e.g.), but that duplicates all columns exceptid
likeattr_1_x
,attr_1_y
, which is not ideal. I've also passed the entire list of columns (there are many) toon
:mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and
attr_1
,attr_2
,attr_3
populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.Thanks in advance.
-
lucid_dreamer over 5 yearsWhat didn't work of the accepted answer:
pd.concat([df,df1], axis=0, ignore_index=True)
? -
MattiH almost 3 yearsFor some reason this doesn't work for me. I got pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
-
Alexey Antonenko over 2 yearsI've tried to merge that way three DFs with different columns. Some of columns were added, some lost.
-
Pavel Prochazka over 2 yearsI arrived at this with non-unique columns. Consider
a = pd.DataFrame({'d':[1], 'b':[2]}).rename(columns={'b':'d'})
andb=pd.DataFrame({'d':[4, 6]})
thenpd.concat([a, b], axis=0, ignore_index=True)
would fail. Although some workarounds can be applied, I believe that it is better to resolve the root of the problem to have unique column names (as in my case). Also, I would expect some warning when trying to rename on already existing column name. -
sql_knievel about 2 yearsThis doesn't seem to be working for me in my current use case, either. Some columns get dropped. Seems to be sensitive to which dataframe is the first in the list to be concatenated? Oddly, running the example from the official concat docs works as advertised regardless of order.