I want to calculate the sum of several columns based on the column name,
import pandas as pd
import numpy as np
dates = pd.date_range(start='2021-01-03',end='2021-02-02',freq='D')
df = pd.DataFrame(data={'date': dates,
'rabbit1_41a': abs(4 0.1*np.random.randn(len(dates))),
'rabbit4_100b': abs(5.2 0.5*np.random.randn(len(dates))),
'kitten11_445a': abs(0.5 0.1*np.random.randn(len(dates))),
'kitten11_72c': abs(0.8 0.5*np.random.randn(len(dates))),
'hare2_1000': abs(7 np.random.randn(len(dates))),
'hare1_58': abs(8 0.8*np.random.randn(len(dates))),
'hare1_26': abs(7.6 0.2*np.random.randn(len(dates))),
'hare3_25': abs(9.1 0.3*np.random.randn(len(dates))),
}
)
# new table, contains sum of rabbit, kitten, hare
df0 = pd.DataFrame(data={'date': dates})
species = ['rabbit', 'kitten', 'hare']
for ii in species:
for jj in df.columns:
# calculate sum of rabbit, kitten, hare
## df0[ii] = df.loc[df[jj][0:int(ii.rindex(ii[-1]) 1)]==ii].sum(axis=1)
df0[ii] = np.select([df[jj].str.contains(ii)]).sum(axis=1)
print(df0.head())
The raw data in df contains daily measurements and I have a new table df0 that covers the same time period. I want to calculate the sum of each column, like
df0['rabbit'] = df['rabbit1_41a'] df['rabbit4_100b']
df0['kitten'] = df['kitten11_445a'] df['kitten11_72c']
df0['hare'] = df['hare2_1000'] df['hare1_58'] df['hare1_26'] df['hare3_25']
How is it done with string slices? This post is useful, but I'm not sure how to adapt it to take the sum of columns.
CodePudding user response:
Since you're only trying to see if the column name contains a substring (as opposed to checking if string data contains a pattern), you could sum over columns that contain a substring.
df0["rabbit"] = df[ [ c for c in df.columns if "rabbit" in c ] ].sum( axis = 1 )
You could do the same for "kitten" and "hare".
Using your code and starting with np.random.seed( 123 ), this would be the top few rows of your output:
>>> df0.head()
date rabbit kitten hare
0 2021-01-03 7.692142 1.920358 33.444147
1 2021-01-04 8.413968 0.999868 31.448853
2 2021-01-05 8.878359 1.957287 32.017930
3 2021-01-06 9.513102 0.838440 33.003162
4 2021-01-07 9.055322 0.622813 31.929664
For reference, when you use pd.Series.str.contains() (which can only be used with string values whereas you have numeric columns), you are checking if the data values (and not the column names) contain whatever pattern you input.
CodePudding user response:
(i) set_index with "date".
(ii) create MultiIndex columns by splitting on string and number
(iii) stack the number level to index
(iv) groupby dates and sum across columns
df = df.set_index('date')
df.columns = df.columns.str.split('([a-z] )(\d )', expand=True)
df = df.droplevel([0,2], axis=1).stack().groupby(level=0).sum().reset_index()
Output:
date hare kitten rabbit
0 2021-01-03 31.705293 0.901572 9.816273
1 2021-01-04 28.816726 0.551446 9.305995
2 2021-01-05 29.331354 1.198637 9.346137
3 2021-01-06 31.019732 0.714525 8.316858
4 2021-01-07 30.554802 1.134589 8.487755
...
CodePudding user response:
If the first 4 characters of the column identify its group uniquely, you can simply do
df1 = df.groupby([c[:4] for c in df.columns], axis=1).sum()
to get something like
date hare kitt rabb
0 0.0 30.276673 1.076560 9.665169
1 0.0 31.774017 1.445791 10.263471
2 0.0 32.620976 1.627564 8.708358
...
which is not quite there yet but close. To beat it into the right shape you could rename the columns and join with df['date']] as the date in df1 got messed up:
species = ['rabbit', 'kitten', 'hare']
sd = {s[:4]:s for s in species}
df[['date']].join(df1.drop(columns = 'date')).rename(columns = sd)
output
date hare kitten rabbit
0 2021-01-03 30.276673 1.076560 9.665169
1 2021-01-04 31.774017 1.445791 10.263471
2 2021-01-05 32.620976 1.627564 8.708358
...
Solution 2
Here we do exact matching assuming species is given as in your question. Then we groupby on that
spnames = [next(s for s in species if s in colname) for colname in df.columns[1:]]
df.set_index('date').groupby(spnames, axis=1).sum()
output as required; for reference
print(spnames)
['rabbit', 'rabbit', 'kitten', 'kitten', 'hare', 'hare', 'hare', 'hare']
CodePudding user response:
Follow the following steps
Rename columns stripping the underscore and trailing alphanumeric characters
Step 1 above allows you to melt the columns into elements defined in list
Convert the dataframe from wide to long
Groupby dates and sum. This in my opinion should be the final stage. No need to join to df0
species = ['rabbit', 'kitten', 'hare']
#Rename df columns to allow use of pd.wide_to_long
df.columns =df.columns.str.replace('\_\w $','', regex=True)
new=(df0.set_index('date')#Set date as index
.join(
pd.wide_to_long(df, species, i="date", j="suffix").droplevel(level=1)#Melt columns into each of the element in the list and sum
.groupby('date').sum()#Do summation by date
)
)
