Dataframe groupby to get stints-CodePudding

I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.

	Person	Location	Date
0	Tom	A	1/1/2021
1	Tom	A	1/2/2021
2	Tom	A	1/3/2021
3	Tom	B	1/4/2021
4	Tom	B	1/5/2021
5	Tom	B	1/6/2021
6	Tom	A	1/7/2021
7	Tom	A	1/8/2021
8	Tom	A	1/9/2021
9	Tom	C	1/10/2021
10	Tom	C	1/11/2021
11	Tom	A	1/12/2021
12	Tom	A	1/13/2021
13	Tom	B	1/14/2021
14	Tom	B	1/15/2021
15	Mark	A	1/1/2021
16	Mark	A	1/2/2021
17	Mark	B	1/3/2021
18	Mark	B	1/4/2021
19	Mark	A	1/5/2021
20	Mark	A	1/6/2021
21	Mark	C	1/7/2021
22	Mark	C	1/8/2021
23	Mark	C	1/9/2021
24	Mark	C	1/10/2021
25	Mark	A	1/11/2021
26	Mark	A	1/12/2021
27	Mark	B	1/13/2021
28	Mark	B	1/14/2021
29	Mark	B	1/15/2021

Expected outcome:

	Person	Location	StintNum	Start_Date	End Date
0	Tom	A	1	1/1/2021	1/3/2021
1	Tom	B	2	1/4/2021	1/6/2021
2	Tom	A	3	1/7/2021	1/9/2021
3	Tom	C	4	1/10/2021	1/11/2021
4	Tom	A	5	1/12/2021	1/13/2021
5	Tom	B	6	1/14/2021	1/15/2021
6	Mark	A	1	1/1/2021	1/2/2021
7	Mark	B	2	1/3/2021	1/4/2021
8	Mark	A	3	1/5/2021	1/6/2021
9	Mark	C	4	1/7/2021	1/10/2021
10	Mark	A	5	1/11/2021	1/12/2021
11	Mark	B	6	1/13/2021	1/15/2021

CodePudding user response：

Try this:

df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)

Output:

>>> new_df
   Person  Location Start_Date   End_Date  StintNum
0     Tom         1 2021-01-01 2021-01-03         1
1     Tom         2 2021-01-04 2021-01-06         2
2     Tom         3 2021-01-07 2021-01-09         3
3     Tom         4 2021-01-10 2021-01-11         4
4     Tom         5 2021-01-12 2021-01-13         5
5     Tom         6 2021-01-14 2021-01-15         6
6    Mark         7 2021-01-01 2021-01-02         1
7    Mark         8 2021-01-03 2021-01-04         2
8    Mark         9 2021-01-05 2021-01-06         3
9    Mark        10 2021-01-07 2021-01-10         4
10   Mark        11 2021-01-11 2021-01-12         5
11   Mark        12 2021-01-13 2021-01-15         6

CodePudding user response：

IMO, a clean way is to use groupby agg, this enables to set custom aggregators easily and is faster than apply:

df['Date'] = pd.to_datetime(df['Date'])

group = df['Location'].ne(df['Location'].shift()).cumsum()

df2 = (
 df.groupby(['Person', group], as_index=False)
   .agg(Location=('Location', 'first'),
        # line below is a dummy function to set a column placeholder
        # uncomment it you want the columns in order
        #StintNum=('Location', lambda x: float('NaN')),
        Start_Date=('Date', 'min'), 
        End_Date=('Date', 'max'),
       )
)

df2['StintNum'] = df2.groupby('Person').cumcount().add(1)

Output:

   Person Location  StintNum Start_Date   End_Date
0    Mark        A         1 2021-01-01 2021-01-02
1    Mark        B         2 2021-01-03 2021-01-04
2    Mark        A         3 2021-01-05 2021-01-06
3    Mark        C         4 2021-01-07 2021-01-10
4    Mark        A         5 2021-01-11 2021-01-12
5    Mark        B         6 2021-01-13 2021-01-15
6     Tom        A         1 2021-01-01 2021-01-03
7     Tom        B         2 2021-01-04 2021-01-06
8     Tom        A         3 2021-01-07 2021-01-09
9     Tom        C         4 2021-01-10 2021-01-11
10    Tom        A         5 2021-01-12 2021-01-13
11    Tom        B         6 2021-01-14 2021-01-15