Understanding the "left_index" and "right_index" arguments in pandas merge
Solution 1
If I understand the behavior of merge
correctly, you should pick only one option for left
and right
respectively (i.e. You should not pick left_on=['x']
and left_index=True
at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge
as to which key
should be actually used as you have shown in current implementation of merge
(I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.
>>> left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
>>> right
key1 key2 C D E
0 K0 K0 C0 D0 1
1 K1 K0 C1 D1 2
2 K1 K0 C2 D2 3
3 K2 K0 C3 D3 4
(1) merge
using ['key1', 'key2']
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K1 A3 B3 NaN NaN NaN
5 K2 K0 NaN NaN C3 D3 4.0
(2) Set ['key1', 'key2']
as left
index and merge
it using the index and keys
>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)
A B key1 key2 C D E
0 A0 B0 K0 K0 C0 D0 1.0
1 A1 B1 K0 K1 NaN NaN NaN
2 A2 B2 K1 K0 C1 D1 2.0
3 A2 B2 K1 K0 C2 D2 3.0
4 A3 B3 K2 K1 NaN NaN NaN
5 NaN NaN K2 K0 C3 D3 4.0
(3) Further set ['key1', 'key2']
as right
index and merge
it using the index
>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K0 NaN NaN C3 D3 4.0
5 K2 K1 A3 B3 NaN NaN NaN
Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2']
are set as index, you can still use left_on = ['key1', 'key2']
instead of left_index=True
.
Now, if you really want to merge using both ['key1', 'key2']
with index
, one way to achieve this is:
>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')
index key1 key2 A B C D E
0 0 K0 K0 A0 B0 C0 D0 1.0
1 1 K0 K1 A1 B1 NaN NaN NaN
2 2 K1 K0 A2 B2 C2 D2 3.0
3 3 K2 K1 A3 B3 NaN NaN NaN
4 1 K1 K0 NaN NaN C1 D1 2.0
5 3 K2 K0 NaN NaN C3 D3 4.0
If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. Hope this helps.
Solution 2
Merging happens in a couple of ways:
Column-Column Merge: Use left_on, right_on and how.
Example:
# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)
Index-Index Merge: Set left_index and right_index to True or use on and use how.
Example:
pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key
Column-Index Merge: Use left_on + right_index or left_index + right_on and how.
Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.
Example:
# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')
# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')
# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)
You kind of mix up the different types of merges which gave weird results. If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.
Related videos on Youtube
ste_kwr
Updated on September 15, 2022Comments
-
ste_kwr over 1 year
I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 'key2': ['K0', 'K1', 'K0', 'K1'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'], 'key2': ['K0', 'K0', 'K0', 'K0'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'], 'E': [1,2,3,4]})
Now, when I run the following command:
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
I get:
key1_x key2_x A B key1_y key2_y C D E _merge 0 K0 K0 A0 B0 K0 K0 C0 D0 1.0 both 1 K0 K1 A1 B1 K1 K0 C1 D1 2.0 both 2 K0 K1 A1 B1 K1 K0 C2 D2 3.0 both 3 K1 K0 A2 B2 NaN NaN NaN NaN NaN left_only 3 K2 K1 A3 B3 NaN NaN NaN NaN NaN left_only 3 NaN NaN NaN NaN K2 K0 C3 D3 4.0 right_only
However, running the same with
right_index=True
gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected resultpd.merge(left, right, on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
Result is:
key1 key2 A B C D E _merge 0 K0 K0 A0 B0 C0 D0 1 both 1 K0 K1 A1 B1 C1 D1 2 both 2 K1 K0 A2 B2 C2 D2 3 both 3 K2 K1 A3 B3 C3 D3 4 both
As you can see, all information for right frame for
key1
andkey2
is completely lost.Please help me understand the purpose and function of these arguments. Thank you.