I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
| Person | Location | Date | |
|---|---|---|---|
| 0 | Tom | A | 1/1/2021 |
| 1 | Tom | A | 1/2/2021 |
| 2 | Tom | A | 1/3/2021 |
| 3 | Tom | B | 1/4/2021 |
| 4 | Tom | B | 1/5/2021 |
| 5 | Tom | B | 1/6/2021 |
| 6 | Tom | A | 1/7/2021 |
| 7 | Tom | A | 1/8/2021 |
| 8 | Tom | A | 1/9/2021 |
| 9 | Tom | C | 1/10/2021 |
| 10 | Tom | C | 1/11/2021 |
| 11 | Tom | A | 1/12/2021 |
| 12 | Tom | A | 1/13/2021 |
| 13 | Tom | B | 1/14/2021 |
| 14 | Tom | B | 1/15/2021 |
| 15 | Mark | A | 1/1/2021 |
| 16 | Mark | A | 1/2/2021 |
| 17 | Mark | B | 1/3/2021 |
| 18 | Mark | B | 1/4/2021 |
| 19 | Mark | A | 1/5/2021 |
| 20 | Mark | A | 1/6/2021 |
| 21 | Mark | C | 1/7/2021 |
| 22 | Mark | C | 1/8/2021 |
| 23 | Mark | C | 1/9/2021 |
| 24 | Mark | C | 1/10/2021 |
| 25 | Mark | A | 1/11/2021 |
| 26 | Mark | A | 1/12/2021 |
| 27 | Mark | B | 1/13/2021 |
| 28 | Mark | B | 1/14/2021 |
| 29 | Mark | B | 1/15/2021 |
Expected outcome:
| Person | Location | StintNum | Start_Date | End Date | |
|---|---|---|---|---|---|
| 0 | Tom | A | 1 | 1/1/2021 | 1/3/2021 |
| 1 | Tom | B | 2 | 1/4/2021 | 1/6/2021 |
| 2 | Tom | A | 3 | 1/7/2021 | 1/9/2021 |
| 3 | Tom | C | 4 | 1/10/2021 | 1/11/2021 |
| 4 | Tom | A | 5 | 1/12/2021 | 1/13/2021 |
| 5 | Tom | B | 6 | 1/14/2021 | 1/15/2021 |
| 6 | Mark | A | 1 | 1/1/2021 | 1/2/2021 |
| 7 | Mark | B | 2 | 1/3/2021 | 1/4/2021 |
| 8 | Mark | A | 3 | 1/5/2021 | 1/6/2021 |
| 9 | Mark | C | 4 | 1/7/2021 | 1/10/2021 |
| 10 | Mark | A | 5 | 1/11/2021 | 1/12/2021 |
| 11 | Mark | B | 6 | 1/13/2021 | 1/15/2021 |
CodePudding user response:
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
CodePudding user response:
IMO, a clean way is to use groupby agg, this enables to set custom aggregators easily and is faster than apply:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15
