Pandas sum over rows using conditional substring of other columns-CodePudding

I want to calculate the sum of several columns based on the column name,

import pandas as pd
import numpy as np

dates = pd.date_range(start='2021-01-03',end='2021-02-02',freq='D')
df = pd.DataFrame(data={'date': dates,
                        'rabbit1_41a': abs(4 0.1*np.random.randn(len(dates))),
                        'rabbit4_100b': abs(5.2 0.5*np.random.randn(len(dates))),
                        'kitten11_445a': abs(0.5 0.1*np.random.randn(len(dates))),
                        'kitten11_72c': abs(0.8 0.5*np.random.randn(len(dates))),
                        'hare2_1000': abs(7 np.random.randn(len(dates))),
                        'hare1_58': abs(8 0.8*np.random.randn(len(dates))),
                        'hare1_26': abs(7.6 0.2*np.random.randn(len(dates))),
                        'hare3_25': abs(9.1 0.3*np.random.randn(len(dates))),
                        }
                  )

# new table, contains sum of rabbit, kitten, hare
df0 = pd.DataFrame(data={'date': dates})

species = ['rabbit', 'kitten', 'hare']

for ii in species:
    for jj in df.columns:
        # calculate sum of rabbit, kitten, hare
##        df0[ii] = df.loc[df[jj][0:int(ii.rindex(ii[-1]) 1)]==ii].sum(axis=1)
        df0[ii] = np.select([df[jj].str.contains(ii)]).sum(axis=1)
print(df0.head())

The raw data in df contains daily measurements and I have a new table df0 that covers the same time period. I want to calculate the sum of each column, like

df0['rabbit'] = df['rabbit1_41a']   df['rabbit4_100b']     
df0['kitten'] = df['kitten11_445a']   df['kitten11_72c']
df0['hare'] = df['hare2_1000']   df['hare1_58']   df['hare1_26']   df['hare3_25']

How is it done with string slices? This post is useful, but I'm not sure how to adapt it to take the sum of columns.

CodePudding user response：

Since you're only trying to see if the column name contains a substring (as opposed to checking if string data contains a pattern), you could sum over columns that contain a substring.

df0["rabbit"] = df[ [ c for c in df.columns if "rabbit" in c ] ].sum( axis = 1 )

You could do the same for "kitten" and "hare".

Using your code and starting with np.random.seed( 123 ), this would be the top few rows of your output:

>>> df0.head()
    date    rabbit    kitten       hare
0 2021-01-03  7.692142  1.920358  33.444147
1 2021-01-04  8.413968  0.999868  31.448853
2 2021-01-05  8.878359  1.957287  32.017930
3 2021-01-06  9.513102  0.838440  33.003162
4 2021-01-07  9.055322  0.622813  31.929664

For reference, when you use pd.Series.str.contains() (which can only be used with string values whereas you have numeric columns), you are checking if the data values (and not the column names) contain whatever pattern you input.

CodePudding user response：

(i) set_index with "date".

(ii) create MultiIndex columns by splitting on string and number

(iii) stack the number level to index

(iv) groupby dates and sum across columns

df = df.set_index('date')
df.columns = df.columns.str.split('([a-z] )(\d )', expand=True)
df = df.droplevel([0,2], axis=1).stack().groupby(level=0).sum().reset_index()

Output:

         date       hare    kitten     rabbit
0  2021-01-03  31.705293  0.901572   9.816273
1  2021-01-04  28.816726  0.551446   9.305995
2  2021-01-05  29.331354  1.198637   9.346137
3  2021-01-06  31.019732  0.714525   8.316858
4  2021-01-07  30.554802  1.134589   8.487755
...

CodePudding user response：

If the first 4 characters of the column identify its group uniquely, you can simply do

df1 = df.groupby([c[:4] for c in df.columns], axis=1).sum()

to get something like

    date    hare        kitt        rabb
0   0.0     30.276673   1.076560    9.665169
1   0.0     31.774017   1.445791    10.263471
2   0.0     32.620976   1.627564    8.708358
...

which is not quite there yet but close. To beat it into the right shape you could rename the columns and join with df['date']] as the date in df1 got messed up:

species = ['rabbit', 'kitten', 'hare']
sd = {s[:4]:s for s in species}
df[['date']].join(df1.drop(columns = 'date')).rename(columns = sd)

output

    date         hare       kitten      rabbit
0   2021-01-03  30.276673   1.076560    9.665169
1   2021-01-04  31.774017   1.445791    10.263471
2   2021-01-05  32.620976   1.627564    8.708358
...

Solution 2

Here we do exact matching assuming species is given as in your question. Then we groupby on that

spnames = [next(s for s in species if s in colname) for colname in df.columns[1:]]
df.set_index('date').groupby(spnames, axis=1).sum()

output as required; for reference

print(spnames)
['rabbit', 'rabbit', 'kitten', 'kitten', 'hare', 'hare', 'hare', 'hare']

CodePudding user response：

Follow the following steps

Rename columns stripping the underscore and trailing alphanumeric characters

Step 1 above allows you to melt the columns into elements defined in list

Convert the dataframe from wide to long

Groupby dates and sum. This in my opinion should be the final stage. No need to join to df0

species = ['rabbit', 'kitten', 'hare']

#Rename df columns to allow use of pd.wide_to_long

df.columns =df.columns.str.replace('\_\w $','', regex=True)

new=(df0.set_index('date')#Set date as index
      .join(
          pd.wide_to_long(df, species, i="date", j="suffix").droplevel(level=1)#Melt columns into each of the element in the list and sum
          .groupby('date').sum()#Do summation by date
      )
          
      )