How to update a single column in a Pandas df with data from a second Pandas df using pd.merge functi-CodePudding

Please see my code below. I was stuck on this problem for a while and so thought I would share the solution I found in case it helps someone else.

The problem (now solved; see solution below) is as follows:

I have a "master" df, called df_master, that holds a large amount of information.
The data in column 'c' of df_master contains pricing data that must sometimes be updated with data found in column 'c' of a second, smaller df. I call this smaller df 'df_updates'.
df_updates contains pricing updates for only a small portion of the rows in df_master, meaning that only a few updates to column 'c' of df_master need to be made.
The column names in df_updates are identical to the column names in df_master.
I wish to update the pricing data in col 'c' of df_master with the information in col 'c' of df_updates.

The solution I found is in the answer below. Hopefully this helps someone facing this same problem.

CodePudding user response：

Please find below the code for my solution to this issue. I've inserted explanatory comments in the code.

import pandas as pd
import numpy as np

# Create the df_master, which will be updated by the df_updates dataframe.
df_master = pd.DataFrame(np.array([[1, 11, 111], [2, 22, 222], [3, 33, 333], [4, 44, 444]]),
                   columns=['a', 'b', 'c'])

# Create df_updates, which contains updates that need to be made to df_master.
df_updates = pd.DataFrame(np.array([[1, 11, 500], [2, 22, 600]]),
                   columns=['a', 'b', 'c'])

# View dfs prior to pd.merge function.
print('df_master')
display(df_master)
print('')
print('df_updates')
display(df_updates)

# Merge df_master and df_updates using all columns other than the column that is to be updated.
df_master_updated = pd.merge(df_master, df_updates, left_on=['a', 'b'], right_on=['a', 'b'], how='left', indicator=False)

# Display df_master_updated to show the unwanted columns 'c_x' and 'c_y' that need to be merged 
# into new col and then deleted.  
print('')
print('df_master_updated')
display(df_master_updated)

# Calculate a new column 'c' by using the values from df_updates, unless df_updates
# has NaN value, in which case we use .fillna() to pull value from column 'c' of df_master.
df_master_updated['c'] = df_master_updated['c_y'].fillna(df_master_updated['c_x'])

# Drop cols 'c_x' and 'c_y'
df_master_updated = df_master_updated.drop(['c_x', 'c_y'], axis=1)

# Display the final updated df_master_updated, where all col 'c' values have been udpated.
print('')
print('df_master_updated')
display(df_master_updated)

CodePudding user response：

It is unclear from your question how you find the rows that needs to be updated, but if df_master and df_updates matches on the index, it is as simple as using .loc and an assignment:

# Build the datasets with dummy data
df_master = pd.DataFrame({'c': range(20)})
# Prices are changed for every other row from row 1 up to (but not including) row 15
df_updates = df_master[1:15:2] * 20

# update values that matches on the index of df_updates

df_master.loc[df_updates.index, 'c'] = df_updates['c']