Home > Back-end >  Find Pandas column largest/smallest values where dates don't overlap
Find Pandas column largest/smallest values where dates don't overlap

Time:01-27

I have a DataFrame like:

df = pd.DataFrame(index = [0,1,2,3,4,5])
df['XYZ'] = [2, 8, 6, 5, 9, 10]
df['Date2'] = ["2005-01-06", "2005-01-07", "2005-01-08", "1994-06-08", "1999-06-15", "2005-01-09"]
df['Date1'] = ["2005-01-02", "2005-01-03", "2005-01-04", "1994-06-04", "1999-06-12", "2005-01-05"]
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

I need to follow the 2 largest values of XYZ with dates that do not overlap. The expected output would be:

XYZ   Date1       Date2
10    2005-01-05  2005-01-09
9     1999-06-12  1999-06-15
5     1994-06-04  1994-06-08

I tried to sort by "XYZ":

df.sort_values(by="XYZ", ascending=False, inplace=True)

And then compare dates:

df['overlap'] = (df['Date1] <= df['Date2'].shift()) & (df['Date2'] >= df['Date1'].shift())

And then drop any True values in df['overlap'] and take the nlargest() values, however that results in cases that do overlap.

Any help would be much appreciated.

CodePudding user response:

This is somewhat involved but hopefully will work for you. We introduce a mask indexed by every date between the min and the max date in your df, where we mark each date as 'used' if it appears in the range, and then use that to reject overlapping rows

First we get the min and the max date (while also sorting the original df by 'XYZ')

df1 = df.sort_values('XYZ', ascending = False)
dmin, dmax = df1[['Date1', 'Date2']].unstack().agg([min,max])

then we create a mask populated with 0s initially

mask = pd.Series(index = pd.date_range(dmin,dmax), data = 0)

Then we iterate over rows marking those we want in the 'include' column

for idx,row in df1.iterrows():
    if sum(mask[row['Date1']:row['Date2']]) > 0:
        df1.loc[idx, 'include'] = False
        continue
    mask[row['Date1']:row['Date2']] = 1
    df1.loc[idx, 'include'] = True

finally filter on 'include'

df1[df1['include']].drop(columns = 'include')

output

    XYZ Date1       Date2
5   10  2005-01-05  2005-01-09
4   9   1999-06-12  1999-06-15
3   5   1994-06-04  1994-06-08
  •  Tags:  
  • Related