Home > Software design >  How to resample ohlc data properly in pandas / custom fill method per column
How to resample ohlc data properly in pandas / custom fill method per column

Time:01-28

I have got OHLC data with missing time frames. Suppose I have the following pandas dataframe denoted by the variable df:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

Now, I resample that pandas dataframe to fill the missing gap and I get the following:

df = df.resample('T').ffill()

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

From the above, we can see that the missing gap (00:01:00) is filled with the help of ffill(). However, the data in that row (row starting with 00:01:00) is not displayed properly as the opening price should be the same as the closing price of the previous row (row starting with 00:00:00). Likewise, the closing price of that row (row starting with 00:01:00) should be the same as the opening price of the next row (row starting with 00:02:00). The desired output should look like this:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

How would I resolve this problem in pandas?

CodePudding user response:

Unfortunately, you can't directly specify a fill method per column.

A workaround would be not to fill the values during the resampling but to do it afterwards:

df = df.resample('T').fillna(None)

df['Open'], df['Close'] = (df['Open'].fillna(df['Close'].ffill()),
                           df['Close'].fillna(df['Open'].bfill()))

df = df.ffill()

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

previous answer (incorrect for OHLC but interesting as generalization)

(df.resample('T')
   .fillna(None)
   .assign(Close=lambda d: d['Close'].bfill())  # bfill for Close
   .ffill()                                     # ffill for others
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67223
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
cross filling the values:
(df.resample('T')
   .fillna(None)
   .assign(Open=lambda d: d['Open'].fillna(d['Close'].ffill())) # Open = last Close
   .ffill()  # ffill the others
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
more options

Here is another example where we'll interpolate High and leave Low as NaNs:

(df.resample('T')
   .fillna(None)
   .assign(Open=lambda d: d['Open'].ffill(),
           Close=lambda d: d['Close'].bfill(),
           High=lambda d: d['High'].interpolate()
          )
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67274      NaN  0.67223
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
  •  Tags:  
  • Related