Getting a count for each category in a Pandas DataFrame-CodePudding

I have a dataframe similar to the following:

import pandas as pd
data = {"Name":["Andrew","Andrew","Andrew","Andrew","Andrew","Andrew","Andrew", "Sam", "Sam", "Sam", "Sam", "Sam"], "PASS":[0, 1, 1, 0, 1, 1, 1, 0, 1, 1,0,1]}
df = pd.DataFrame(data=data)

Output

    Name    PASS
0   Andrew  0
1   Andrew  1
2   Andrew  1
3   Andrew  0
4   Andrew  1
5   Andrew  1
6   Andrew  1
7   Sam     0
8   Sam     1
9   Sam     1
10  Sam     0
11  Sam     1

I want to generate a dataframe which contains the largest consecutive passes for each student:

    Name    MAX_PASS
0   Andrew  3
1   Sam     2

I need a little help modifying the code that I have so far. The count is outputting 0110110110 and the result = 2. Which is not quite correct. I think I'm close but need some help to get over the finish line. Thanks.

count = ''
for i in range(len(df)-1):
    if df.Name[i] == df.Name[i 1]:
        if df.PASS[i] == 0:
            count  = "0"
        else:
            count  = "1"  
            result = len(max(count.split('0')))

CodePudding user response：

You can consider to adapt this answer

def max_strike_group(x, col):
    x = x[col]
    a = x != 0
    out = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
    return out.max()

df.groupby("Name").apply(lambda x:max_strike_group(x, "PASS"))

Name
Andrew    3
Sam       2
dtype: int64

CodePudding user response：

One option is to call on cumsum twice, the first time is to add up the 0s and 1s, the second time is to get the values after the reset:

TL-DR:

cum1 = df.groupby('Name').PASS.cumsum()
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64

Explanation:

# first cumulative sum
cum1 = df.groupby('Name').PASS.cumsum()
cum1
0     0
1     1
2     2
3     2
4     3
5     4
6     5
7     0
8     1
9     2
10    2
11    3
Name: PASS, dtype: int64

# look for rows where the reset should occur
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
cum1
array([ 0,  1,  1, -2,  1,  1,  1,  0,  1,  1, -2,  1])

# build the max_pass column
# with the second cumsum and groupby
# before grouping again to get the max
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64