I would like to calculate the difference amount of total_spent for each customer_id by taking the difference for each two column pairs. In this case subtracting total_spent_2020 from total_spent_2021 but in a more dynamic way. In numbers it would be substracting: column_3 from column_2 and column_5 from column_4 and so on.
I was thinking of taking the length of the df and add a range with a 2 interval but I am struggling to do the subtraction based on that interval number.
sample_df
customer_id total_spent_2021 total_spent_2020 total_spent_2019 total_spent_2018
9391 120 100 80 90
2932 90 110 60 50
4294 150 130 40 50
ouput_df
customer_id total_spent_2021_2020_diff total_spent_2019_2018_diff
9391 20 -10
2932 -20 10
4294 20 -10
CodePudding user response:
Create groups in array a by integer division by 2, create new column names by extract numbers and join and aggregate DataFrameGroupBy.diff with remove only NaNs columns with new columns names by s:
df = df.set_index('customer_id')
a = np.arange(len(df.columns)) // 2
s = (df.columns
.to_series()
.str.extract('(\d )', expand=False)
.groupby(a)
.agg('_'.join)
.radd('total_spent_')
.add('_diff'))
print (s)
0 total_spent_2021_2020_diff
1 total_spent_2019_2018_diff
dtype: object
df = df.groupby(a, axis=1).diff(-1).dropna(how='all', axis=1)
df.columns = s
print (df)
total_spent_2021_2020_diff total_spent_2019_2018_diff
customer_id
9391 20 -10
2932 -20 10
4294 20 -10
Another idea without groupby:
df1 = df.set_index('customer_id')
df1.columns = df1.columns.str.extract('(\d )', expand=False)
a = np.arange(len(df.columns)) // 2
df2 = df1.iloc[:, ::2].sub(df1.iloc[:, 1::2].to_numpy())
df2.columns = 'total_spent_' df1.columns[::2] '_' df1.columns[1::2] '_diff'
print (df2)
total_spent_2021_2020_diff total_spent_2019_2018_diff
customer_id
9391 20 -10
2932 -20 10
4294 20 -10
EDIT:
If there are strings mixed with numbers is possible convert them to missing values:
df = df.set_index('customer_id').apply(pd.to_numeric, errors='coerce')
For testing:
print (df)
customer_id total_spent_2021 total_spent_2020 total_spent_2019 \
0 9391 120 100 80
1 2932 90 110 60
2 4294 150 130 40
total_spent_2018
0 90
1 50
2 50
df = df.set_index('customer_id')
print (df)
total_spent_2021 total_spent_2020 total_spent_2019 \
customer_id
9391 120 100 80
2932 90 110 60
4294 150 130 40
total_spent_2018
customer_id
9391 90
2932 50
4294 50
