I have a dataframe similar to the following:
import pandas as pd
data = {"Name":["Andrew","Andrew","Andrew","Andrew","Andrew","Andrew","Andrew", "Sam", "Sam", "Sam", "Sam", "Sam"], "PASS":[0, 1, 1, 0, 1, 1, 1, 0, 1, 1,0,1]}
df = pd.DataFrame(data=data)
Output
Name PASS
0 Andrew 0
1 Andrew 1
2 Andrew 1
3 Andrew 0
4 Andrew 1
5 Andrew 1
6 Andrew 1
7 Sam 0
8 Sam 1
9 Sam 1
10 Sam 0
11 Sam 1
I want to generate a dataframe which contains the largest consecutive passes for each student:
Name MAX_PASS
0 Andrew 3
1 Sam 2
I need a little help modifying the code that I have so far. The count is outputting 0110110110 and the result = 2. Which is not quite correct. I think I'm close but need some help to get over the finish line. Thanks.
count = ''
for i in range(len(df)-1):
if df.Name[i] == df.Name[i 1]:
if df.PASS[i] == 0:
count = "0"
else:
count = "1"
result = len(max(count.split('0')))
CodePudding user response:
You can consider to adapt this answer
def max_strike_group(x, col):
x = x[col]
a = x != 0
out = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return out.max()
df.groupby("Name").apply(lambda x:max_strike_group(x, "PASS"))
Name
Andrew 3
Sam 2
dtype: int64
CodePudding user response:
One option is to call on cumsum twice, the first time is to add up the 0s and 1s, the second time is to get the values after the reset:
TL-DR:
cum1 = df.groupby('Name').PASS.cumsum()
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
(df.assign(PASS = cum1,
max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)
Name
Andrew 3
Sam 2
Name: max_pass, dtype: int64
Explanation:
# first cumulative sum
cum1 = df.groupby('Name').PASS.cumsum()
cum1
0 0
1 1
2 2
3 2
4 3
5 4
6 5
7 0
8 1
9 2
10 2
11 3
Name: PASS, dtype: int64
# look for rows where the reset should occur
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
cum1
array([ 0, 1, 1, -2, 1, 1, 1, 0, 1, 1, -2, 1])
# build the max_pass column
# with the second cumsum and groupby
# before grouping again to get the max
(df.assign(PASS = cum1,
max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)
Name
Andrew 3
Sam 2
Name: max_pass, dtype: int64
