I have a dataframe that looks like this:
import pandas as pd
data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-12', 9]]
df = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-15 1
2 A 2022-09-15 2022-09-18 12
3 B 2022-09-01 2022-09-03 4
4 B 2022-09-03 2022-09-07 7
5 B 2022-09-07 2022-09-12 9
I have a certain_date, for example, 2022-09-10. I would like to split the row where the certain_date is in the range of start_date and end_date per row per group. If the certain_date is in the range of the two dates, the end_date of that row should change to certain_date and add an extra row below where the start_date is the certain_date and the end_date is the original end_date where the value should be the same for both rows. Here you can see the expected output:
certain_date = '2022-09-10'
data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-10', 1], ['A', '2022-09-10', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-10', 9], ['B', '2022-09-10', '2022-09-12', 9]]
df_desired = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-10 2022-09-15 1
3 A 2022-09-15 2022-09-18 12
4 B 2022-09-01 2022-09-03 4
5 B 2022-09-03 2022-09-07 7
6 B 2022-09-07 2022-09-10 9
7 B 2022-09-10 2022-09-12 9
For GROUP A you can see that the certain_date is in the range of the dates in the second row of the dataframe. As you can see it gets to split the way I described above. So I was wondering if there is a way to solve this using pandas?
CodePudding user response:
You can identify the matching rows, then concat the dataframe without those and the slice with changed start or stop:
certain_date = '2022-09-10'
# is date after start?
m1 = df['start_date'].lt(certain_date)
# is date before stop?
m2 = df['end_date'].gt(certain_date)
# is both? (you could do all in one line)
m = m1&m2
out = pd.concat([df[~m],
df[m].assign(start_date=certain_date),
df[m].assign(end_date=certain_date)]).sort_index()
output:
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-10 2022-09-15 1
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-15 2022-09-18 12
3 B 2022-09-01 2022-09-03 4
4 B 2022-09-03 2022-09-07 7
5 B 2022-09-10 2022-09-12 9
5 B 2022-09-07 2022-09-10 9
CodePudding user response:
You can try split the start_date, certain_date and end_date into list then explode the result
df[['start_date', 'end_date']] = df.apply(lambda row: [[row['start_date'], certain_date],
[certain_date, row['end_date']]]
if row['start_date'] < certain_date < row['end_date']
else [row['start_date'], row['end_date']],
axis=1, result_type='expand')
out = df.explode(['start_date', 'end_date'], ignore_index=True)
print(out)
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-10 2022-09-15 1
3 A 2022-09-15 2022-09-18 12
4 B 2022-09-01 2022-09-03 4
5 B 2022-09-03 2022-09-07 7
6 B 2022-09-07 2022-09-10 9
7 B 2022-09-10 2022-09-12 9
