Pandas substraction in intervals for column pairs-CodePudding

I would like to calculate the difference amount of total_spent for each customer_id by taking the difference for each two column pairs. In this case subtracting total_spent_2020 from total_spent_2021 but in a more dynamic way. In numbers it would be substracting: column_3 from column_2 and column_5 from column_4 and so on.

I was thinking of taking the length of the df and add a range with a 2 interval but I am struggling to do the subtraction based on that interval number.

sample_df

customer_id  total_spent_2021 total_spent_2020  total_spent_2019 total_spent_2018
9391         120              100               80               90
2932         90               110               60               50
4294         150              130               40               50

ouput_df

customer_id  total_spent_2021_2020_diff  total_spent_2019_2018_diff
9391         20                           -10
2932        -20                            10
4294         20                           -10

CodePudding user response：

Create groups in array a by integer division by 2, create new column names by extract numbers and join and aggregate DataFrameGroupBy.diff with remove only NaNs columns with new columns names by s:

df = df.set_index('customer_id')
a = np.arange(len(df.columns)) // 2

s = (df.columns
       .to_series()
       .str.extract('(\d )', expand=False)
       .groupby(a)
       .agg('_'.join)
       .radd('total_spent_')
       .add('_diff'))
print (s)
0    total_spent_2021_2020_diff
1    total_spent_2019_2018_diff
dtype: object

df = df.groupby(a, axis=1).diff(-1).dropna(how='all', axis=1)
df.columns = s
print (df)
             total_spent_2021_2020_diff  total_spent_2019_2018_diff
customer_id                                                        
9391                                 20                         -10
2932                                -20                          10
4294                                 20                         -10

Another idea without groupby:

df1 = df.set_index('customer_id')
df1.columns = df1.columns.str.extract('(\d )', expand=False)
a = np.arange(len(df.columns)) // 2


df2 = df1.iloc[:, ::2].sub(df1.iloc[:, 1::2].to_numpy())
df2.columns = 'total_spent_'   df1.columns[::2]   '_'   df1.columns[1::2]   '_diff'
print (df2) 
             total_spent_2021_2020_diff  total_spent_2019_2018_diff
customer_id                                                        
9391                                 20                         -10
2932                                -20                          10
4294                                 20                         -10

EDIT:

If there are strings mixed with numbers is possible convert them to missing values:

df = df.set_index('customer_id').apply(pd.to_numeric, errors='coerce')

For testing:

print (df)
   customer_id  total_spent_2021  total_spent_2020  total_spent_2019  \
0         9391               120               100                80   
1         2932                90               110                60   
2         4294               150               130                40   

   total_spent_2018  
0                90  
1                50  
2                50  

df = df.set_index('customer_id')
print (df)
             total_spent_2021  total_spent_2020  total_spent_2019  \
customer_id                                                         
9391                      120               100                80   
2932                       90               110                60   
4294                      150               130                40   

             total_spent_2018  
customer_id                    
9391                       90  
2932                       50  
4294                       50