I need to check what has changed over time in two datasets with different date:
Prior date:
Date ID Value Category Subcategory
30-Nov 0001 100.00 A A100
30-Nov 0002 200.00 B B120
30-Nov 0003 300.00 C C300
30-Nov 0004 450.00 D D900
30-Nov 0005 500.00 D D900
Current date:
Date ID Value Category Subcategory
31-Dec 0001 100.00 A A100
31-Dec 0002 200.00 B B101
31-Dec 0003 300.00 C C300
31-Dec 0004 400.00 E E900
31-Dec 0006 600.00 D D900
Now I need to create 4 dataframes:
- Changes in Value:
Date ID Value Category Subcategory Prior Value
31-Dec 0004 400.00 E E900 450.00
- Changes in Category:
Date ID Value Category Subcategory Prior Category
31-Dec 0004 400.00 E E900 D
- Changes in Subcategory, but only if category didn't change:
Date ID Value Category Subcategory Prior Subcategory
31-Dec 0002 200.00 B B101 B120
- Items population change:
Date ID Value Category Subcategory
31-Dec 0006 600.00 D D900
30-Nov 0005 500.00 D D900
I think I should first run the population check and exclude those breaks, so I will have only two datasets with identical ID sets. I will follow the examples from here: Comparing two dataframes and getting the differences
For comparing 1to1 values I found a numpy piece of code but it compares it by default index, not by ID, how to do it using my ID column as record identifier? This is going to be a large dataset and I can't base it on default index.
value_df = current_df
value_df['prior value'] = np.where(prior_df['Value'] == current_df['Value'], 'Match', prior_df['Value'])
value_df = value_df[value_df['prior value'] != 'Match']
For multiple conditions do I have to filter it out step by step (first filter out category change, then filter subcategory change) or can I use AND to concatenate the conditions?
below is the code for creating dataframes:
prior_data = {'Date': ['30-Nov','30-Nov','30-Nov','30-Nov', '30-Nov'],
'ID': ['0001','0002','0003','0004', '0005'],
'Value' : [100.00, 200.00, 300.00, 450.00, 500.00],
'Category' : ['A','B','C','D','D'],
'Subcategory' : ['A100','B120','C300','D900','D900']}
current_data = {'Date': ['31-Dec','31-Dec','31-Dec','31-Dec','31-Dec'],
'ID': ['0001','0002','0003','0004', '0006'],
'Value' : [100.00, 200.00, 300.00, 400.00, 600.00],
'Category' : ['A','B','C','E','D'],
'Subcategory' : ['A100','B101','C300','E900','D900']}
prior_df = pd.DataFrame(prior_data)
current_df = pd.DataFrame(current_data)
CodePudding user response:
I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge. As you say, let's first deal with things that are in one dataframe but not the other:
def get_only_left(df1, df2):
left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
added_columns = [c '_other' for c in df1.columns if c != 'ID']
mask = left_merge.loc[:, added_columns].isna().all(axis=1)
return left_merge[mask].drop(added_columns, axis=1)
pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])
This gives
Date ID Value Category Subcategory
4 30-Nov 0005 500.0 D D900
4 31-Dec 0006 600.0 D D900
Then, let's deal with properly changing values.
columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns ['Value_prior']]
This gives
Date ID Value Category Subcategory Value_prior
3 31-Dec 0004 400.0 E E900 450.0
Then similarly:
mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns ['Category_prior']]
gives
Date ID Value Category Subcategory Category_prior
3 31-Dec 0004 400.0 E E900 D
And finally
import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns ['Subcategory_prior']]
gives
Date ID Value Category Subcategory Subcategory_prior
1 31-Dec 0002 200.0 B B101 B120
