Efficient way subtract a row with previous row seperated by group with Pandas-CodePudding

The objective is to subtract a row (N) with previous row (N-1) separated by groups.

Given a df

   years nchar  nval
0   2019     a     1
1   2019     b     1
2   2019     c     1
3   2020     a     1
4   2020     s     4

Lets,separate into group of year 2019, and we denote it as df_2019

For df_2019, there we assign constant 10.

Then,only for index 0, we do the following operation and assign to a new column 'B`

df_2019.loc[df_2019.index[0], 'B']= 10 - df_2019['nval'].values[0]

Whereas, the other index

df_2019.loc[df_2019.index[N], 'B'] = df_2019['B'].values[N-1] - df_2019['nval'].values[N]

This, will produced the following table

   years      nchar nval C   D    B
 
1   2019          a    1          9
2   2019          b    1          8
3   2019          c    1          7

For the group 2020, the same computation apply. However, the only difference is, the constant value is the 7, which is taken from the last index of column B.

To answer this requirement, the following code is produced with extra possible groups.

import pandas as pd
year=[2019,2019,2019,2020,2020,2020,2020,2022,2022,2022]
nval=[1,1,1,1,4,1,4,5,6,7]
nchar=['a','b','c','a','s','c','a','b','c','g']

df=pd.DataFrame(zip(year,nchar,nval),columns=['years','nchar','nval'])
print(df)
year_ls=[2019,2020,2022]
nspacing_total=2

nspacing_between_df=4
all_df=[]
default_val=10
for idx,dyear in enumerate(year_ls):
  df_=df[df['years']==dyear].reset_index(drop=True)
  t=pd.DataFrame([[''] * 3]*len(df_), columns=["C", "D", "B"])
  df_=pd.concat([df_,t],axis=1)
  Total = df_['nval'].sum()
  df_=pd.DataFrame([[''] * len(df.columns)]*1, columns=df.columns).append(df_).reset_index(drop=True)

  if idx ==0:
    df_.loc[df_.index[0], 'B']=default_val
    

  if idx !=0:
    
    pre_df=all_df[idx-1]
    pre_val=pre_df['B'].values[-1]

    nposi=1

    pre_years=pre_df['years'].values[nposi]
    df_.loc[df_.index[0], 'nchar']=f'From {pre_years}'
    df_.loc[df_.index[0], 'B']=pre_val

  for ndexd in range(df_.shape[0]-1):
    df_.loc[df_.index[ndexd 1], 'B']=df_['B'].values[ndexd]-df_['nval'].values[ndexd 1]
    
  df_=df_.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_total, columns=df.columns)).reset_index(drop=True)
  df_.loc[df_.index[-1], 'nval']=Total
  df_.loc[df_.index[-1], 'nchar']='Total'
  df_.loc[df_.index[-1], 'B']=df_['B'].values[0]-df_['nval'].values[-1]

  all_df.append(df_)

However, I wonder whether this proposal can be further simplified further using pandas groupby or other. I really appreciate for any tips.

Ultimately, I would like to express the table as below, which will be exported to excel

   years      nchar nval C   D    B
0                           10     
1   2019          a    1          9
2   2019          b    1          8
3   2019          c    1          7
4                                  
5             Total    3          7
6                                  
7                                  
8                                  
9                                  
10        From 2019               7
11  2020          a    1          6
12  2020          s    4          2
13  2020          c    1          1
14  2020          a    4         -3
15                                 
16            Total   10         -3
17                                 
18                                 
19                                 
20                                 
21        From 2020              -3
22  2022          b    5         -8
23  2022          c    6        -14
24  2022          g    7        -21
25                                 
26            Total   18        -21
27                                 
28                                 
29                                 
30

The code to produced the above table

# Optional to represent the table above
all_ap_df=[]
for a_df in all_df:
  df=a_df.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_between_df, columns=df.columns)).reset_index(drop=True)
  all_ap_df.append(df)
df=pd.concat(all_ap_df,axis=0).reset_index(drop=True)


df.loc[df_.index[0], 'D']=df['B'].values[0]
df.loc[df_.index[0], 'B']=''
df = df.fillna('')

CodePudding user response：

I think this is actually quite simple. Use groupby cumsum:

df['B'] = 10 - df['nval'].cumsum()

Output:

>>> df
   years nchar  nval  B
0   2019     a     1  9
1   2019     b     1  8
2   2019     c     1  7
3   2020     a     1  6
4   2020     s     4  2

CodePudding user response：

In your case chain with groupby

df['new'] = df.groupby('years')['nval'].cumsum().rsub(10)
Out[8]: 
0    9
1    8
2    7
3    9
4    5
Name: nval, dtype: int64