Home > Mobile >  Pandas upsample and nearest interpolation give only NaNs
Pandas upsample and nearest interpolation give only NaNs

Time:01-30

I have a dataframe (df, time as index and 1 column 'Pt0') that I want to upsample and interpolate with "nearest neighbor" method. I have 2 issues:

  1. When I compute df = df.upsample('1D'), I get an object core.resample.DatetimeIndexResampler which keeps me from recovering the values of my column (but I can get the index) while I only want a dataframe as output. What I don't understand is that applying this command to other dataframes usually gives me a dataframe, not that "core" object.
  2. If I apply the upsampling and interpolation directly: df = df.resample('1D').interpolate(method='nearest') I only obtain NaNs while before I had NaNs and values.

I don't understand what I am doing wrong, and I wasn't able to understand why a "core" object is created while this same method (df.resample('1D')) gave me dataframes in other cases. How can I solve this problem ?

Ps: df does not have duplicates in the index because it was computed specifically to avoid any (Pandas drop duplicates and replace the value by the nanmean of the duplicates).

Here is the dataframe:

df
Out[174]: 
                                 Pt0
1984-06-10 00:00:00.096000064  -42.0
1984-07-20 00:00:00.176000000    NaN
1984-07-28 00:00:00.192000000  -26.0
1984-10-08 00:00:00.336000064  -12.0
1984-10-16 00:00:00.352000000   -5.0
                             ...
2021-04-05 08:48:28.559141120 -248.0
2021-04-05 08:48:29.059141120 -318.0
2021-04-19 20:36:46.060141056 -311.0
2021-05-04 03:02:44.279659008 -254.0
2021-05-29 02:55:17.930625024 -286.0

[529 rows x 1 columns]

Code to reproduce my issues:

    df = pd.DataFrame({'Pt0': [np.nan, -42.0, np.nan, np.nan, -26.0, np.nan, np.nan, np.nan, 0.0, -10.0]}, 
                 index=['1984-06-10 00:00:00.096000064', '1984-06-10 00:00:00.096000064',
                        '1984-07-20 00:00:00.176000000', '1984-07-20 00:00:00.176000000',
                        '1984-07-28 00:00:00.192000000', '1984-07-28 00:00:00.192000000',
                        '1984-09-06 00:00:00.080000000', '1984-09-06 00:00:00.080000000',
                        '1984-09-06 00:00:00.271999936', '1984-09-06 00:00:00.271999936'])
    df.index = pd.to_datetime(df.index)
    df = df.groupby(level=0)['Pt0'].transform(np.nanmean).drop_duplicates().to_frame()
    df2 = df.resample('1D')
    df3 = df.resample('1D').interpolate('nearest')

CodePudding user response:

You're doing nothing wrong but you missed one step before. You need to align your source index to your target index (days): 1984-06-10 00:00:00.096000064 is not equal to 1984-06-10 i.e. 1984-06-10 00:00:00.000000000. That's why you can see your original values in the Resampler object but not in the final result:

>>> list(df.resample('D'))[0]
(Timestamp('1984-06-10 00:00:00', freq='D'),
                                 Pt0
 1984-06-10 00:00:00.096000064 -42.0)

What you need is to apply an operation between resample and interpolate. In the present case, take the first value is sufficient but imagine you have multiple values for a same day, how Pandas can guess which value to pick?. In fact, before upsampling to get extra days you need to downsampling your intraday values.

Step-1:

>>> df.resample('D').first()  # or mean() or whatever you want
             Pt0
1984-06-10 -42.0  # <- now the index and values are aligned
1984-06-11   NaN
1984-06-12   NaN
1984-06-13   NaN
1984-06-14   NaN
...          ...
1984-09-02   NaN
1984-09-03   NaN
1984-09-04   NaN
1984-09-05   NaN
1984-09-06  -5.0

[89 rows x 1 columns]

Step-2:

>>> df.resample('D').first().interpolate('time')  # or nearest or ...
                  Pt0
1984-06-10 -42.000000
1984-06-11 -41.666667
1984-06-12 -41.333333
1984-06-13 -41.000000
1984-06-14 -40.666667
...               ...
1984-09-02  -7.100000
1984-09-03  -6.575000
1984-09-04  -6.050000
1984-09-05  -5.525000
1984-09-06  -5.000000

[89 rows x 1 columns]

You can do the same with groupby which can be more intuitive at all because it's more natural: I want to group my values per day then compute the mean and finally fill missing values by interpolate with a X method

>>> df.groupby(pd.Grouper(freq='D')).mean().interpolate('time')
                  Pt0
1984-06-10 -42.000000
1984-06-11 -41.666667
1984-06-12 -41.333333
1984-06-13 -41.000000
1984-06-14 -40.666667
...               ...
1984-09-02  -7.100000
1984-09-03  -6.575000
1984-09-04  -6.050000
1984-09-05  -5.525000
1984-09-06  -5.000000

[89 rows x 1 columns]
  •  Tags:  
  • Related