I have a DataFrame like:
df = pd.DataFrame(index = [0,1,2,3,4,5])
df['XYZ'] = [2, 8, 6, 5, 9, 10]
df['Date2'] = ["2005-01-06", "2005-01-07", "2005-01-08", "1994-06-08", "1999-06-15", "2005-01-09"]
df['Date1'] = ["2005-01-02", "2005-01-03", "2005-01-04", "1994-06-04", "1999-06-12", "2005-01-05"]
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
I need to follow the 2 largest values of XYZ with dates that do not overlap. The expected output would be:
XYZ Date1 Date2
10 2005-01-05 2005-01-09
9 1999-06-12 1999-06-15
5 1994-06-04 1994-06-08
I tried to sort by "XYZ":
df.sort_values(by="XYZ", ascending=False, inplace=True)
And then compare dates:
df['overlap'] = (df['Date1] <= df['Date2'].shift()) & (df['Date2'] >= df['Date1'].shift())
And then drop any True values in df['overlap'] and take the nlargest() values, however that results in cases that do overlap.
Any help would be much appreciated.
CodePudding user response:
This is somewhat involved but hopefully will work for you. We introduce a mask indexed by every date between the min and the max date in your df, where we mark each date as 'used' if it appears in the range, and then use that to reject overlapping rows
First we get the min and the max date (while also sorting the original df by 'XYZ')
df1 = df.sort_values('XYZ', ascending = False)
dmin, dmax = df1[['Date1', 'Date2']].unstack().agg([min,max])
then we create a mask populated with 0s initially
mask = pd.Series(index = pd.date_range(dmin,dmax), data = 0)
Then we iterate over rows marking those we want in the 'include' column
for idx,row in df1.iterrows():
if sum(mask[row['Date1']:row['Date2']]) > 0:
df1.loc[idx, 'include'] = False
continue
mask[row['Date1']:row['Date2']] = 1
df1.loc[idx, 'include'] = True
finally filter on 'include'
df1[df1['include']].drop(columns = 'include')
output
XYZ Date1 Date2
5 10 2005-01-05 2005-01-09
4 9 1999-06-12 1999-06-15
3 5 1994-06-04 1994-06-08
