How to merge two data frames having same column names horizontally on basis of similar values in one-CodePudding

I have two data frames as shown below:

A	B	C	D
Red	36	1	type-1
Blue	78	2	type-1
Green	59	3	type-1

A	B	C	D
Orange	78	5	type-2
Purple	59	7	type-2
Brown	36	9	type-2

I want to merge the above two data frames on the basis of column B and after merge I want to keep the same columns as shown below:

A	B	C	D	A	B	C	D
Red	36	1	type-1	Brown	36	9	type-2
Blue	78	2	type-1	Orange	78	5	type-2
Green	59	3	type-1	Purple	59	7	type-2

Is it possible to do this using pandas or any other python function?

I have tried using pd.merge function but I needed to change the column names. There exists another function called pd.concat but can I provide the column name (column 'B') in it for merging?

Thanks a lot in advance!

CodePudding user response：

apply rename to jezrael's anwer and you will get desired output

out = (df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
       .rename(columns=lambda x: x.split('_')[0]))

out

    A       B   C   D       A       B   C   D
0   Red     36  1   type-1  Brown   36  9   type-2
1   Blue    78  2   type-1  Orange  78  5   type-2
2   Green   59  3   type-1  Purple  59  7   type-2

CodePudding user response：

You can pass to parameters left_on and right_on columns from both DataFrames, so is created helper column key_0, which is removed after join by DataFrame.merge:

Notice: Pandas has problem with duplicated columns names, it is reason why merge rename them by suffix _x and _y

df = df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
print (df)
     A_x  B_x  C_x     D_x     A_y  B_y  C_y     D_y
0    Red   36    1  type-1   Brown   36    9  type-2
1   Blue   78    2  type-1  Orange   78    5  type-2
2  Green   59    3  type-1  Purple   59    7  type-2

What is problem with same columns names:

If need select column first A expected ouput is get Series.

print (df.A_x)
0      Red
1     Blue
2    Green
Name: A_x, dtype: object

But if duplicated names get all columns in DataFrame, DONT DO IT:

df = df.rename(columns=lambda x: x.split('_')[0])
# print (df)

print (df.A)
       A       A
0    Red   Brown
1   Blue  Orange
2  Green  Purple