I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset. I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest). Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
CodePudding user response:
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.
