I have two data frames as shown below:
| A | B | C | D |
|---|---|---|---|
| Red | 36 | 1 | type-1 |
| Blue | 78 | 2 | type-1 |
| Green | 59 | 3 | type-1 |
| A | B | C | D |
|---|---|---|---|
| Orange | 78 | 5 | type-2 |
| Purple | 59 | 7 | type-2 |
| Brown | 36 | 9 | type-2 |
I want to merge the above two data frames on the basis of column B and after merge I want to keep the same columns as shown below:
| A | B | C | D | A | B | C | D |
|---|---|---|---|---|---|---|---|
| Red | 36 | 1 | type-1 | Brown | 36 | 9 | type-2 |
| Blue | 78 | 2 | type-1 | Orange | 78 | 5 | type-2 |
| Green | 59 | 3 | type-1 | Purple | 59 | 7 | type-2 |
Is it possible to do this using pandas or any other python function?
I have tried using pd.merge function but I needed to change the column names. There exists another function called pd.concat but can I provide the column name (column 'B') in it for merging?
Thanks a lot in advance!
CodePudding user response:
apply rename to jezrael's anwer and you will get desired output
out = (df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
.rename(columns=lambda x: x.split('_')[0]))
out
A B C D A B C D
0 Red 36 1 type-1 Brown 36 9 type-2
1 Blue 78 2 type-1 Orange 78 5 type-2
2 Green 59 3 type-1 Purple 59 7 type-2
CodePudding user response:
You can pass to parameters left_on and right_on columns from both DataFrames, so is created helper column key_0, which is removed after join by DataFrame.merge:
Notice: Pandas has problem with duplicated columns names, it is reason why merge rename them by suffix _x and _y
df = df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
print (df)
A_x B_x C_x D_x A_y B_y C_y D_y
0 Red 36 1 type-1 Brown 36 9 type-2
1 Blue 78 2 type-1 Orange 78 5 type-2
2 Green 59 3 type-1 Purple 59 7 type-2
What is problem with same columns names:
If need select column first A expected ouput is get Series.
print (df.A_x)
0 Red
1 Blue
2 Green
Name: A_x, dtype: object
But if duplicated names get all columns in DataFrame, DONT DO IT:
df = df.rename(columns=lambda x: x.split('_')[0])
# print (df)
print (df.A)
A A
0 Red Brown
1 Blue Orange
2 Green Purple
