Home > Software design >  How to show differences from two pandas dataframes of different sizes
How to show differences from two pandas dataframes of different sizes

Time:01-08

If I have two dataframes that look like:

current_month
| Product | Revenue | Expense | Profit | PaymentFrequency | Customer |
| ------- | ------- | ------- | ------ | ---------------- | -------- |
| A       | 100     | 100     | 0      | Monthly          | Cust1    |
| B       | 200     | 150     | 50     | Monthly          | Cust2    |
| C       | 90      | 80      | 10     | Monthly          | Cust3    |

previous_month
| Product | Revenue | Expense | Profit | PaymentFrequency | Customer |
| ------- | ------- | ------- | ------ | ---------------- | -------- |
| A       | 120     | 120     | 0      | Monthly          | Cust1    |
| B       | 250     | 175     | 75     | Monthly          | Cust1    |

For each product I would like to have a table of just the differences:

Product A
| month | Revenue | Expense |
| ----- | ------- | ------- |
| current_month | 100 | 100 |
| previous_month | 120 | 120 |

Product B
| month | Revenue | Expense | Profit | Customer |
| ----- | ------- | ------- | ------ | -------- |
| current_month | 200 | 150 | 50 | Cust2 |
| previous_month | 250 | 175 | 75 | Cust1 |

Product C
| month | Revenue | Expense | Profit | PaymentFrequency | Customer |
| ----- | ------- | ------- | ------ | ---------------- | -------- |
| current_month | 90      | 80      | 10     | Monthly          | Cust3    |
| previous_month | NaN | NaN | NaN | NaN | NaN |

I've been able to identify the differences using a for loop and .loc. However, I am struggling to get the desired output.

for product in list(current_month.index):
    for field in list(current_month.columns):
        try:
            if current_month[field].loc[product] != previous_month[field].loc[product]:
                print(f'field: {field}')
                print(f'product: {product}')
                print(f'new value: {current_month[field].loc[product]}')
                print(f'old value: {previous_month[field].loc[product]}') 
        except KeyError:
            print(f'field: {field}')
            print(f'product: {product}')
            print(f'new value: {current_month[field].loc[product]}')
            print(f'NaN')

CodePudding user response:

(i) First merge the dataframes and stack them; this will create a MultiIndex pd.Series object df_m.

(ii) Rename the MultiIndex, sort by it and unstack.

(iii) Filter for products (which is the first level of the MultiIndex), transpose the dataframe and use drop_duplicates if a value is repeated across two months.

df_m = df1.merge(df2, on='Product', how='outer', suffixes=(' current', ' previous')).set_index('Product').stack()
df_m.index = pd.MultiIndex.from_tuples([(i,) tuple(j.split()) for i,j in df_m.index])
df_m = df_m.sort_index().unstack()


out = [(df_m[df_m.index.get_level_values(0) == product]
        .T
        .replace(np.nan,'NaN')
        .apply(lambda x: x.drop_duplicates(keep=False), axis=0)
        .dropna(axis=1)
        .replace('NaN',np.nan)) 
       for product in ['A','B','C']]
productA, productB, productC = out

Output:

               A        
         Expense Revenue
current      100     100
previous   120.0   120.0

                B                       
         Customer Expense Profit Revenue
current     Cust2     150     50     200
previous    Cust1   175.0   75.0   250.0

                C                                        
         Customer Expense PaymentFrequency Profit Revenue
current     Cust3    80.0          Monthly   10.0    90.0
previous      NaN     NaN              NaN    NaN     NaN
  •  Tags:  
  • Related