I would like to compare two pandas DataFrames for differences using only the columns that are of dtype bools.
However using the code below I am getting the error below even though the index is for both df's the same on a customer_id level
ValueError: Can only compare identically-labeled DataFrame objects
BOOL_FIELDS = ['is_mobile','is_desktop','is_cancelled','is_existing_customer']
temp_df = pd.DataFrame()
customer_df_2020.set_index('customer_id',inplace=True)
customer_df_2021.set_index('customer_id',inplace=True)
temp_df['sort'] = (customer_df_2020[BOOL_FIELDS] != customer_df_2021[BOOL_FIELDS])
customer_df_2020
customer_id is_mobile is_desktop is_cancelled is_existing_customer
30293 TRUE FALSE FALSE TRUE
28313 FALSE TRUE FALSE TRUE
19313 FALSE TRUE FALSE TRUE
customer_df_2021
customer_id is_mobile is_desktop is_cancelled is_existing_customer
30293 FALSE TRUE TRUE FALSE
28313 FALSE TRUE FALSE TRUE
19313 FALSE TRUE TRUE FALSE
CodePudding user response:
Seems some indices are different, is possible extract same in both by Index.intersection:
BOOL_FIELDS = ['is_mobile','is_desktop','is_cancelled','is_existing_customer']
customer_df_2020.set_index('customer_id',inplace=True)
customer_df_2021.set_index('customer_id',inplace=True)
sameidx = customer_df_2020.index.intersection(customer_df_2021.index)
temp_df = (customer_df_2020.loc[sameidx, BOOL_FIELDS] !=
customer_df_2021.loc[sameidx, BOOL_FIELDS])
