Split row if certain date is within a date range using Pandas-CodePudding

I have a dataframe that looks like this:

import pandas as pd

data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-12', 9]]
df = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-15      1
2     A  2022-09-15  2022-09-18     12
3     B  2022-09-01  2022-09-03      4
4     B  2022-09-03  2022-09-07      7
5     B  2022-09-07  2022-09-12      9

I have a certain_date, for example, 2022-09-10. I would like to split the row where the certain_date is in the range of start_date and end_date per row per group. If the certain_date is in the range of the two dates, the end_date of that row should change to certain_date and add an extra row below where the start_date is the certain_date and the end_date is the original end_date where the value should be the same for both rows. Here you can see the expected output:

certain_date = '2022-09-10'

data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-10', 1], ['A', '2022-09-10', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-10', 9], ['B', '2022-09-10', '2022-09-12', 9]]
df_desired = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-10      1
2     A  2022-09-10  2022-09-15      1
3     A  2022-09-15  2022-09-18     12
4     B  2022-09-01  2022-09-03      4
5     B  2022-09-03  2022-09-07      7
6     B  2022-09-07  2022-09-10      9
7     B  2022-09-10  2022-09-12      9

For GROUP A you can see that the certain_date is in the range of the dates in the second row of the dataframe. As you can see it gets to split the way I described above. So I was wondering if there is a way to solve this using pandas?

CodePudding user response：

You can identify the matching rows, then concat the dataframe without those and the slice with changed start or stop:

certain_date = '2022-09-10'
# is date after start?
m1 = df['start_date'].lt(certain_date)
# is date before stop?
m2 = df['end_date'].gt(certain_date)
# is both? (you could do all in one line)
m = m1&m2

out = pd.concat([df[~m],
                 df[m].assign(start_date=certain_date),
                 df[m].assign(end_date=certain_date)]).sort_index()

output:

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-10  2022-09-15      1
1     A  2022-09-05  2022-09-10      1
2     A  2022-09-15  2022-09-18     12
3     B  2022-09-01  2022-09-03      4
4     B  2022-09-03  2022-09-07      7
5     B  2022-09-10  2022-09-12      9
5     B  2022-09-07  2022-09-10      9

CodePudding user response：

You can try split the start_date, certain_date and end_date into list then explode the result

df[['start_date', 'end_date']] = df.apply(lambda row: [[row['start_date'], certain_date],
                                                       [certain_date, row['end_date']]]
                                          if row['start_date'] < certain_date < row['end_date']
                                          else [row['start_date'], row['end_date']],
                                          axis=1, result_type='expand')
out = df.explode(['start_date', 'end_date'], ignore_index=True)

print(out)

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-10      1
2     A  2022-09-10  2022-09-15      1
3     A  2022-09-15  2022-09-18     12
4     B  2022-09-01  2022-09-03      4
5     B  2022-09-03  2022-09-07      7
6     B  2022-09-07  2022-09-10      9
7     B  2022-09-10  2022-09-12      9