Given two dataframes in which two keys share a centralised value in another dataframe as follows:
| manager_id | employee_id |
|---|---|
| 1 | 5 |
| 4 | 3 |
| 4 | 2 |
| 7 | 9 |
| employee_id | location |
|---|---|
| 1 | US |
| 2 | France |
| 3 | UK |
| 4 | Russia |
| 5 | Mexico |
| 6 | Spain |
| 7 | Italy |
| 8 | Greece |
| 9 | Hungary |
how to merge with pandas the key in the second dataframe to the other two in the first one so it ends up as follows:
| manager_id | employee_id | manager_location | employee_location |
|---|---|---|---|
| 1 | 5 | US | Mexico |
| 4 | 3 | Russia | UK |
| 4 | 2 | Russia | France |
| 7 | 9 | Italy | Hungary |
to be clear, both manager_id and employee_id share the same id key in the location table, even if they have distinct names.
Thanks in advance.
CodePudding user response:
You can map ids to locations by converting the DataFrame with location data df2 into a Series where indices are employee_ids and values are locations:
id_to_location_mapper = df2.set_index('employee_id')['location']
df1['manager_location'] = df1['manager_id'].map(id_to_location_mapper)
df1['employee_location'] = df1['employee_id'].map(id_to_location_mapper)
You can also unstack, map, unstack and join:
df1 = df1.join(df1.unstack().map(df2.set_index('employee_id')['location']).swaplevel().unstack().add_suffix('_location'))
Output:
manager_id employee_id manager_location employee_location
0 1 5 US Mexico
1 4 3 Russia UK
2 4 2 Russia France
3 7 9 Italy Hungary
CodePudding user response:
You can use left join like SQL
df = pd.DataFrame({
'manager_id' : [1,4,4,7],
'employee_id':[5,3,2,9]
})
df
manager_id employee_id
0 1 5
1 4 3
2 4 2
3 7 9
df2 = pd.DataFrame({
'employee_id' : [1,2,3,4,5,6,7,8,9],
'location':['US','FR','UK','RU','MX','ES','IT','GR','HU']
})
df2
employee_id location
0 1 US
1 2 FR
2 3 UK
3 4 RU
4 5 MX
5 6 ES
6 7 IT
7 8 GR
8 9 HU
Left join with pandas dataframes
df_result = df.merge(df2, on=['employee_id'], how='left').merge(df2,left_on=['manager_id'],right_on=['employee_id'], how='left')
df_result = df_result.drop('employee_id_y', 1)
df_result
Result
manager_id employee_id_x location_x location_y
0 1 5 MX US
1 4 3 UK RU
2 4 2 FR RU
3 7 9 HU IT
