Home > Software engineering >  How to update a single column in a Pandas df with data from a second Pandas df using pd.merge functi
How to update a single column in a Pandas df with data from a second Pandas df using pd.merge functi

Time:02-10

Please see my code below. I was stuck on this problem for a while and so thought I would share the solution I found in case it helps someone else.

The problem (now solved; see solution below) is as follows:

  • I have a "master" df, called df_master, that holds a large amount of information.
  • The data in column 'c' of df_master contains pricing data that must sometimes be updated with data found in column 'c' of a second, smaller df. I call this smaller df 'df_updates'.
  • df_updates contains pricing updates for only a small portion of the rows in df_master, meaning that only a few updates to column 'c' of df_master need to be made.
  • The column names in df_updates are identical to the column names in df_master.
  • I wish to update the pricing data in col 'c' of df_master with the information in col 'c' of df_updates.

The solution I found is in the answer below. Hopefully this helps someone facing this same problem.

CodePudding user response:

Please find below the code for my solution to this issue. I've inserted explanatory comments in the code.

import pandas as pd
import numpy as np

# Create the df_master, which will be updated by the df_updates dataframe.
df_master = pd.DataFrame(np.array([[1, 11, 111], [2, 22, 222], [3, 33, 333], [4, 44, 444]]),
                   columns=['a', 'b', 'c'])

# Create df_updates, which contains updates that need to be made to df_master.
df_updates = pd.DataFrame(np.array([[1, 11, 500], [2, 22, 600]]),
                   columns=['a', 'b', 'c'])

# View dfs prior to pd.merge function.
print('df_master')
display(df_master)
print('')
print('df_updates')
display(df_updates)

# Merge df_master and df_updates using all columns other than the column that is to be updated.
df_master_updated = pd.merge(df_master, df_updates, left_on=['a', 'b'], right_on=['a', 'b'], how='left', indicator=False)

# Display df_master_updated to show the unwanted columns 'c_x' and 'c_y' that need to be merged 
# into new col and then deleted.  
print('')
print('df_master_updated')
display(df_master_updated)

# Calculate a new column 'c' by using the values from df_updates, unless df_updates
# has NaN value, in which case we use .fillna() to pull value from column 'c' of df_master.
df_master_updated['c'] = df_master_updated['c_y'].fillna(df_master_updated['c_x'])

# Drop cols 'c_x' and 'c_y'
df_master_updated = df_master_updated.drop(['c_x', 'c_y'], axis=1)

# Display the final updated df_master_updated, where all col 'c' values have been udpated.
print('')
print('df_master_updated')
display(df_master_updated)

CodePudding user response:

It is unclear from your question how you find the rows that needs to be updated, but if df_master and df_updates matches on the index, it is as simple as using .loc and an assignment:

# Build the datasets with dummy data
df_master = pd.DataFrame({'c': range(20)})
# Prices are changed for every other row from row 1 up to (but not including) row 15
df_updates = df_master[1:15:2] * 20 
# update values that matches on the index of df_updates

df_master.loc[df_updates.index, 'c'] = df_updates['c']
    c
0   0
1   20
2   2
3   60
4   4
5   100
6   6
7   140
8   8
9   180
10  10
11  220
12  12
13  260
14  14
15  15
16  16
17  17
18  18
19  19
  •  Tags:  
  • Related