I have 2 dataframes: df_dec_light and df_rally.
df_dec_light.head():
log_return month year
1970-12-01 0.003092 12 1970
1970-12-02 0.011481 12 1970
1970-12-03 0.004736 12 1970
1970-12-04 0.006279 12 1970
1970-12-07 0.005351 12 1970
1970-12-08 -0.005239 12 1970
1970-12-09 0.000782 12 1970
1970-12-10 0.004235 12 1970
1970-12-11 0.003774 12 1970
1970-12-14 -0.005109 12 1970
df_rally.head():
rally_start rally_end
0 1970-12-18 1970-12-31
1 1971-12-17 1971-12-31
2 1972-12-15 1972-12-29
3 1973-12-21 1973-12-31
4 1974-12-20 1974-12-31
I need to filter df_dec_light based on condition that df_dec_light.index is between values of columns df_rally['rally_start']and df_rally['rally_end'].
I've tried something like this:
df_dec_light[(df_dec_light.index >= df_rally['rally_start']) & (df_dec_light.index <= df_rally['rally_end'])]
I was expecting to to recieve filtered df_dec_light dataframe with indexes that are within intervals between df_rally['rally_start'] and df_rally['rally_end']. Something like this:
log_return month year
1970-12-18 0.001997 12 1970
1970-12-21 -0.003108 12 1970
1970-12-22 0.001111 12 1970
1970-12-23 0.000666 12 1970
1970-12-24 0.005644 12 1970
1970-12-28 0.005283 12 1970
1970-12-29 0.010810 12 1970
1970-12-30 0.002061 12 1970
1970-12-31 -0.001301 12 1970
Would really apreciate any help. Thanks!
CodePudding user response:
To solve this we can first turn the ranges in df_rally into pd.DateTimeIndex by calling pd.date_range on each row. This will give us each row of df_rally as a pd.DateTimeIndex.
As we want to later check if the index of df_dec_light is in any of the ranges, we want to combine all of these ranges. This is done with union.
We assert that the newly created pd.Series index_list is not empty and then select its first element. This element is the pd.DateTimeIndex on which we can now call union with all other pd.DateTimeIndex.
We can now use pd.Index.isin to create a boolean array of whether each index Date is found in the passed set of Dates.
If we now apply this mask to df_dec_light it returns only the entries that are within one of the specified ranges of df_rally.
index_list = df_rally.apply(lambda x: pd.date_range(x['rally_start'], x['rally_end']), axis=1)
assert(not index_list.empty)
all_ranges=index_list.iloc[0]
for range in index_list:
all_ranges=all_ranges.union(range)
print(all_ranges)
mask = df_dec_light.index.isin(all_ranges)
print(df_dec_light[mask])
CodePudding user response:
Let's create an IntervalIndex from the start and end column values in df_rally dataframe, then map the intervals on index of df_dec_light dataframe and use notna to check if the index values are contained in any interval
ix = pd.IntervalIndex.from_arrays(df_rally.rally_start, df_rally.rally_end, closed='both')
mask = df_dec_light.index.map(ix.to_series()).notna()
then use the mask to filter the dataframe
df_dec_light[mask]
