Please see my code below. I was stuck on this problem for a while and so thought I would share the solution I found in case it helps someone else.
The problem (now solved; see solution below) is as follows:
- I have a "master" df, called df_master, that holds a large amount of information.
- The data in column 'c' of df_master contains pricing data that must sometimes be updated with data found in column 'c' of a second, smaller df. I call this smaller df 'df_updates'.
- df_updates contains pricing updates for only a small portion of the rows in df_master, meaning that only a few updates to column 'c' of df_master need to be made.
- The column names in df_updates are identical to the column names in df_master.
- I wish to update the pricing data in col 'c' of df_master with the information in col 'c' of df_updates.
The solution I found is in the answer below. Hopefully this helps someone facing this same problem.
CodePudding user response:
Please find below the code for my solution to this issue. I've inserted explanatory comments in the code.
import pandas as pd
import numpy as np
# Create the df_master, which will be updated by the df_updates dataframe.
df_master = pd.DataFrame(np.array([[1, 11, 111], [2, 22, 222], [3, 33, 333], [4, 44, 444]]),
columns=['a', 'b', 'c'])
# Create df_updates, which contains updates that need to be made to df_master.
df_updates = pd.DataFrame(np.array([[1, 11, 500], [2, 22, 600]]),
columns=['a', 'b', 'c'])
# View dfs prior to pd.merge function.
print('df_master')
display(df_master)
print('')
print('df_updates')
display(df_updates)
# Merge df_master and df_updates using all columns other than the column that is to be updated.
df_master_updated = pd.merge(df_master, df_updates, left_on=['a', 'b'], right_on=['a', 'b'], how='left', indicator=False)
# Display df_master_updated to show the unwanted columns 'c_x' and 'c_y' that need to be merged
# into new col and then deleted.
print('')
print('df_master_updated')
display(df_master_updated)
# Calculate a new column 'c' by using the values from df_updates, unless df_updates
# has NaN value, in which case we use .fillna() to pull value from column 'c' of df_master.
df_master_updated['c'] = df_master_updated['c_y'].fillna(df_master_updated['c_x'])
# Drop cols 'c_x' and 'c_y'
df_master_updated = df_master_updated.drop(['c_x', 'c_y'], axis=1)
# Display the final updated df_master_updated, where all col 'c' values have been udpated.
print('')
print('df_master_updated')
display(df_master_updated)
CodePudding user response:
It is unclear from your question how you find the rows that needs to be updated, but if df_master and df_updates matches on the index, it is as simple as using .loc and an assignment:
# Build the datasets with dummy data
df_master = pd.DataFrame({'c': range(20)})
# Prices are changed for every other row from row 1 up to (but not including) row 15
df_updates = df_master[1:15:2] * 20
# update values that matches on the index of df_updates
df_master.loc[df_updates.index, 'c'] = df_updates['c']
c
0 0
1 20
2 2
3 60
4 4
5 100
6 6
7 140
8 8
9 180
10 10
11 220
12 12
13 260
14 14
15 15
16 16
17 17
18 18
19 19
