Understanding the "left_index" and "right_index" arguments in pandas merge

18,590

Solution 1

If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.

>>> left
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

>>> right
  key1 key2   C   D  E
0   K0   K0  C0  D0  1
1   K1   K0  C1  D1  2
2   K1   K0  C2  D2  3
3   K2   K0  C3  D3  4

(1) merge using ['key1', 'key2']

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K1   A3   B3  NaN  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3  4.0

(2) Set ['key1', 'key2'] as left index and merge it using the index and keys

>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)

     A    B key1 key2    C    D    E
0   A0   B0   K0   K0   C0   D0  1.0
1   A1   B1   K0   K1  NaN  NaN  NaN
2   A2   B2   K1   K0   C1   D1  2.0
3   A2   B2   K1   K0   C2   D2  3.0
4   A3   B3   K2   K1  NaN  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3  4.0

(3) Further set ['key1', 'key2'] as right index and merge it using the index

>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K0  NaN  NaN   C3   D3  4.0
5   K2   K1   A3   B3  NaN  NaN  NaN

Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.

Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:

>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')

   index key1 key2    A    B    C    D    E
0      0   K0   K0   A0   B0   C0   D0  1.0
1      1   K0   K1   A1   B1  NaN  NaN  NaN
2      2   K1   K0   A2   B2   C2   D2  3.0
3      3   K2   K1   A3   B3  NaN  NaN  NaN
4      1   K1   K0  NaN  NaN   C1   D1  2.0
5      3   K2   K0  NaN  NaN   C3   D3  4.0

If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. Hope this helps.

Solution 2

Merging happens in a couple of ways:

Column-Column Merge: Use left_on, right_on and how.

Example:

# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)

Index-Index Merge: Set left_index and right_index to True or use on and use how.

Example:

pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key

Column-Index Merge: Use left_on + right_index or left_index + right_on and how.

Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.

Example:

# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')  

# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')

# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)

You kind of mix up the different types of merges which gave weird results. If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.

Share:
18,590

Related videos on Youtube

ste_kwr
Author by

ste_kwr

Updated on September 15, 2022

Comments

  • ste_kwr
    ste_kwr over 1 year

    I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:

    left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
    
    right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                          'key2': ['K0', 'K0', 'K0', 'K0'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3'],
                          'E': [1,2,3,4]})
    

    Now, when I run the following command:

    pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
    

    I get:

      key1_x key2_x    A    B key1_y key2_y    C    D    E      _merge
    0     K0     K0   A0   B0     K0     K0   C0   D0  1.0        both
    1     K0     K1   A1   B1     K1     K0   C1   D1  2.0        both
    2     K0     K1   A1   B1     K1     K0   C2   D2  3.0        both
    3     K1     K0   A2   B2    NaN    NaN  NaN  NaN  NaN   left_only
    3     K2     K1   A3   B3    NaN    NaN  NaN  NaN  NaN   left_only
    3    NaN    NaN  NaN  NaN     K2     K0   C3   D3  4.0  right_only
    

    However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result

    pd.merge(left, right,  on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
    

    Result is:

      key1 key2   A   B   C   D  E _merge
    0   K0   K0  A0  B0  C0  D0  1   both
    1   K0   K1  A1  B1  C1  D1  2   both
    2   K1   K0  A2  B2  C2  D2  3   both
    3   K2   K1  A3  B3  C3  D3  4   both
    

    As you can see, all information for right frame for key1 and key2 is completely lost.

    Please help me understand the purpose and function of these arguments. Thank you.