I have two dataframes (excel sheets) and I want to select only certain columns from a resulting pd.merge
df1 = pd.read_excel('Receivables.xlsx')
#the sheet looks like the table below
| ID | Dollar Type | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|---|
| 1 | Receivables | 10 | 20 | 30 | 44 |
| 2 | Receivables | 11 | 21 | 41 | 54 |
df2 = pd.read_excel('Payables.xlsx')
#the sheet looks like the table below
| Payable_ID | Dollar Type | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|---|
| 1 | Payables | 10 | -20 | -30 | -44 |
| 2 | Payables | 11 | -21 | -41 | -54 |
My merge then looks like this
leftJoin = df1.merge(df2, left_on=['ID'], right_on=['Payable_ID'], how='left')
How do I only select columns Dollar Type, Q1, Q2 , Q3 , Q4 from df1 and select Payable_ID, Q1, Q2 from df2. The problem I'm running into is that the duplicate columns are appended with either _x or _y. While I get why it's doing that, is there a way to turn off this appending?
In SQL, there's no need for me to specify, for example, df1.[Dollar Type_x] or df2.[Q1_y]
SELECT DF1.[DOLLAR TYPE],
DF1.[Q1],
DF1.[Q2],
DF1.[Q3],
DF1.[Q4],
DF2.[PAYABLE_ID],
DF2.[Q1],
DF2.[Q2]
FROM DF1 LEFT JOIN DF2 ON DF1.ID = DF.PAYABLE_ID
CodePudding user response:
You can pass only selected columns of the DataFrames to merge:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left')
Now, you have undesired 'ID' column in the merged DataFrame which you can drop using
leftJoin.drop(columns=['ID'], inplace=True)
You can also make the drop inline with the merge:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left').drop(columns=['ID'], inplace=True)
To change the names of the columns, you can use rename:
leftJoin.rename(columns={'Q1_x':'Q1', 'Q2_x':'Q2','Q1_y':'Q5','Q2_y':'Q6'}, inplace=True)
Which again can be done in the same line:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left').drop(columns=['ID'], inplace=True).rename(columns={'Q1_x':'Q1', 'Q2_x':'Q2','Q1_y':'Q5','Q2_y':'Q6'}, inplace=True)
