Home > Mobile >  Compare two DataFrames for differences but getting 'Can only compare identically-labeled DataFr
Compare two DataFrames for differences but getting 'Can only compare identically-labeled DataFr

Time:02-05

I would like to compare two pandas DataFrames for differences using only the columns that are of dtype bools.

However using the code below I am getting the error below even though the index is for both df's the same on a customer_id level

ValueError: Can only compare identically-labeled DataFrame objects

BOOL_FIELDS = ['is_mobile','is_desktop','is_cancelled','is_existing_customer']

temp_df = pd.DataFrame()
customer_df_2020.set_index('customer_id',inplace=True)
customer_df_2021.set_index('customer_id',inplace=True)

temp_df['sort'] = (customer_df_2020[BOOL_FIELDS] != customer_df_2021[BOOL_FIELDS])

customer_df_2020

customer_id   is_mobile  is_desktop  is_cancelled  is_existing_customer
30293          TRUE      FALSE       FALSE         TRUE
28313          FALSE     TRUE        FALSE         TRUE
19313          FALSE     TRUE        FALSE         TRUE

customer_df_2021

customer_id   is_mobile  is_desktop  is_cancelled  is_existing_customer
30293          FALSE     TRUE        TRUE          FALSE
28313          FALSE     TRUE        FALSE         TRUE
19313          FALSE     TRUE        TRUE          FALSE 

CodePudding user response:

Seems some indices are different, is possible extract same in both by Index.intersection:

BOOL_FIELDS = ['is_mobile','is_desktop','is_cancelled','is_existing_customer']

customer_df_2020.set_index('customer_id',inplace=True)
customer_df_2021.set_index('customer_id',inplace=True)

sameidx = customer_df_2020.index.intersection(customer_df_2021.index)

temp_df  = (customer_df_2020.loc[sameidx, BOOL_FIELDS] != 
            customer_df_2021.loc[sameidx, BOOL_FIELDS])
  •  Tags:  
  • Related