I have a dataframe of clients CLIENT_ID as shown below:
| CLIENT_ID | CURRENT_DATE_STATUS | STATUS |
|---|---|---|
| 10002 | 2017-07-21 | STARTED |
| 10002 | 2017-07-21 | STARTED |
| 10002 | 2018-07-01 | CHURNED |
| 10002 | 2018-07-01 | CHURNED |
| 10002 | 2019-01-01 | RESTARTED |
| 11811 | 2019-08-15 | STARTED |
| 11811 | 2019-08-15 | STARTED |
| 11811 | 2019-12-31 | RESTARTED |
| 22101 | 2020-03-11 | STARTED |
| 22101 | 2020-03-11 | STARTED |
| 22101 | 2020-03-11 | STARTED |
| 22101 | 2020-11-01 | CHURNED |
| 22300 | 2018-05-06 | STARTED |
| 22300 | 2018-05-06 | STARTED |
The dataframe is sorted by CLIENT_ID and CURRENT_DATE_STATUS. How can I create an indicator Boolean 1 or 0 column which indicates:
- If the previous
STATUSentry has changed to eitherCHURNED or RESTARTEDfor eachCLIENT_ID.
The resultant dataframe would be as shown below:
| CLIENT_ID | CURRENT_DATE_STATUS | STATUS | STOPPED |
|---|---|---|---|
| 10002 | 2017-07-21 | STARTED | 0 |
| 10002 | 2017-07-21 | STARTED | 0 |
| 10002 | 2018-07-01 | CHURNED | 1 |
| 10002 | 2018-07-01 | CHURNED | 0 |
| 10002 | 2019-01-01 | RESTARTED | 1 |
| 11811 | 2019-08-15 | STARTED | 0 |
| 11811 | 2019-08-15 | STARTED | 0 |
| 11811 | 2019-12-31 | RESTARTED | 1 |
| 22101 | 2020-03-11 | STARTED | 0 |
| 22101 | 2020-03-11 | STARTED | 0 |
| 22101 | 2020-03-11 | STARTED | 0 |
| 22101 | 2020-11-01 | CHURNED | 1 |
| 22300 | 2018-05-06 | STARTED | 0 |
| 22300 | 2018-05-06 | STARTED | 0 |
here is code to generate the dataframe
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
CodePudding user response:
You can check whether the current value equals to previous value:
cond = lambda d: (d.shift().ne(d)&d.isin(["CHURNED", "RESTARTED"])).astype(int)
df["STOPPED"] = df.groupby("CLIENT_ID")["STATUS"].transform(cond)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
CodePudding user response:
You can compare actual values for eqaul by Series.eq with shifted per groups by DataFrameGroupBy.shift for not equalSeries.ne, chain by & for bitwise AND and last chain by | for bitwise OR with casting to integers:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
Another solution is compare shifted values by previous and then if match by list in Series.isin, last chain by & for bitwise AND:
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
