How to merge two Data Frames based on a few columns in second Data Frame in Python Pandas?-CodePudding

I have two Pandas Data Frame in Python like below:

df1

ID
----
11
22
33
44

df2

ID1   ID2   ID3
--------------------
11  | 5   | 114
88  | 22  | 18
99  | 45  | 33

df1 has more rows than df2
types of values in boths Data Frames is int

I need to do something like df1 LEFT JOIN df2 and merge df1 with df2 using "ID" from df1 and "ID1", "ID2", "ID3" from df2

merge Data Frames on ID (df1) and ID1 (df2)
if ID does not merge with ID1 --> merge on ID and ID2
if ID does not merge with ID2 --> merge on ID and ID3
give 123456 if rows does not merge

So as a result I need something like below:

ID    ID1   ID2   ID3
--------------------------
11  | 11     | 5       | 114
22  | 88     | 22      | 18
33  | 99     | 45      | 33
44  | 123456 | 123456  | 123456

How can I do that in Python Pandas ? I totally do not know.

CodePudding user response：

You can stack df2 to becomes df2a, then left join df1 with df2a followed by left join original df2 matching the original index. Fill NaN with 123456 and drop intermediate columns to arrive at the desired output:

df2a = df2.stack().reset_index(name='ID')

df_out = (df1.merge(df2a, on='ID', how='left')
             .merge(df2, left_on='level_0', right_index=True, how='left')
             .fillna(123456, downcast='infer')
             .drop(['level_0', 'level_1'], axis=1)
         )

or simplify the second .merge with .join (thanks for the suggestion of @HenryEcker), as follows:

df2a = df2.stack().reset_index(name='ID')

df_out = (df1.merge(df2a, on='ID', how='left')
             .join(df2, on='level_0')
             .fillna(123456, downcast='infer')
             .drop(['level_0', 'level_1'], axis=1)
         )

Result:

print(df_out)

   ID     ID1     ID2     ID3
0  11      11       5     114
1  22      88      22      18
2  33      99      45      33
3  44  123456  123456  123456

Break down of step:

print(df2a)

   level_0 level_1   ID
0        0     ID1   11
1        0     ID2    5
2        0     ID3  114
3        1     ID1   88
4        1     ID2   22
5        1     ID3   18
6        2     ID1   99
7        2     ID2   45
8        2     ID3   33

CodePudding user response：

You should have a look here (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), you have many different soutions. For example:

import pandas as pd

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    }
)


df3 = pd.DataFrame(
{
    "A": ["A8", "A9", "A10", "A11", "A12"],
    "B": ["B8", "B9", "B10", "B11", "B12"],
    "C": ["C8", "C9", "C10", "C11", "C12"],
    "D": ["D8", "D9", "D10", "D11", "D12"],
}
)
df = pd.concat([df1, df2, df3], axis=1)

Gives you:

     A    B    C    D    A    B    C    D    A    B    C    D
0   A0   B0   C0   D0   A4   B4   C4   D4   A8   B8   C8   D8
1   A1   B1   C1   D1   A5   B5   C5   D5   A9   B9   C9   D9
2   A2   B2   C2   D2   A6   B6   C6   D6  A10  B10  C10  D10
3   A3   B3   C3   D3   A7   B7   C7   D7  A11  B11  C11  D11
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A12  B12  C12  D12

So, in your case:

df1 = pd.DataFrame(
    {
        "ID": [11, 22, 33, 44]
    }
)


df2 = pd.DataFrame(
    {
        "ID1": [11, 88, 99],
        "ID2": [5, 22, 45],
        "ID3": [114, 18, 33]
    }
)


df = pd.concat([df1, df2], axis=1)
df.fillna('123456', inplace=True) # to replace NaNs with the values you want

Gives:

   ID     ID1     ID2     ID3
0  11    11.0     5.0   114.0
1  22    88.0    22.0    18.0
2  33    99.0    45.0    33.0
3  44  123456  123456  123456