I have two Pandas Data Frame in Python like below:
df1
ID
----
11
22
33
44
df2
ID1 ID2 ID3
--------------------
11 | 5 | 114
88 | 22 | 18
99 | 45 | 33
- df1 has more rows than df2
- types of values in boths Data Frames is int
I need to do something like df1 LEFT JOIN df2 and merge df1 with df2 using "ID" from df1 and "ID1", "ID2", "ID3" from df2
- merge Data Frames on ID (df1) and ID1 (df2)
- if ID does not merge with ID1 --> merge on ID and ID2
- if ID does not merge with ID2 --> merge on ID and ID3
- give 123456 if rows does not merge
So as a result I need something like below:
ID ID1 ID2 ID3
--------------------------
11 | 11 | 5 | 114
22 | 88 | 22 | 18
33 | 99 | 45 | 33
44 | 123456 | 123456 | 123456
How can I do that in Python Pandas ? I totally do not know.
CodePudding user response:
You can stack df2 to becomes df2a, then left join df1 with df2a followed by left join original df2 matching the original index. Fill NaN with 123456 and drop intermediate columns to arrive at the desired output:
df2a = df2.stack().reset_index(name='ID')
df_out = (df1.merge(df2a, on='ID', how='left')
.merge(df2, left_on='level_0', right_index=True, how='left')
.fillna(123456, downcast='infer')
.drop(['level_0', 'level_1'], axis=1)
)
or simplify the second .merge with .join (thanks for the suggestion of @HenryEcker), as follows:
df2a = df2.stack().reset_index(name='ID')
df_out = (df1.merge(df2a, on='ID', how='left')
.join(df2, on='level_0')
.fillna(123456, downcast='infer')
.drop(['level_0', 'level_1'], axis=1)
)
Result:
print(df_out)
ID ID1 ID2 ID3
0 11 11 5 114
1 22 88 22 18
2 33 99 45 33
3 44 123456 123456 123456
Break down of step:
print(df2a)
level_0 level_1 ID
0 0 ID1 11
1 0 ID2 5
2 0 ID3 114
3 1 ID1 88
4 1 ID2 22
5 1 ID3 18
6 2 ID1 99
7 2 ID2 45
8 2 ID3 33
CodePudding user response:
You should have a look here (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), you have many different soutions. For example:
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
}
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11", "A12"],
"B": ["B8", "B9", "B10", "B11", "B12"],
"C": ["C8", "C9", "C10", "C11", "C12"],
"D": ["D8", "D9", "D10", "D11", "D12"],
}
)
df = pd.concat([df1, df2, df3], axis=1)
Gives you:
A B C D A B C D A B C D
0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D8
1 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D9
2 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D10
3 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11
4 NaN NaN NaN NaN NaN NaN NaN NaN A12 B12 C12 D12
So, in your case:
df1 = pd.DataFrame(
{
"ID": [11, 22, 33, 44]
}
)
df2 = pd.DataFrame(
{
"ID1": [11, 88, 99],
"ID2": [5, 22, 45],
"ID3": [114, 18, 33]
}
)
df = pd.concat([df1, df2], axis=1)
df.fillna('123456', inplace=True) # to replace NaNs with the values you want
Gives:
ID ID1 ID2 ID3
0 11 11.0 5.0 114.0
1 22 88.0 22.0 18.0
2 33 99.0 45.0 33.0
3 44 123456 123456 123456
