Home > OS >  Why do I get unexpected nonzero values when doing a quantiles calculation from timeseries data?
Why do I get unexpected nonzero values when doing a quantiles calculation from timeseries data?

Time:01-31

I have a csv file with timeseries data of hourly PV production for a whole year. I want to get the quantiles (from 0.1 to 0.9) for each hour in order to represent their behavior every 24 h for all days of the year.

It seemed to work ok at first, but then I realized that in my some of the quantiles I have values different than zero (0) at 00:00 which is impossible as my dataset does not have such values in that time of the day (I triple checked it).

Where is the mistake?

My code:

colnames = ['Date', 'Energy']
df_1 = pd.read_csv('PV_Autumn_2020.csv', names = colnames  , encoding="utf8", delimiter=";")

start_date = datetime(2022, 9, 27, 1, 0)
end_date = datetime(2022, 9, 28, 1, 0)

def daterange(start_date, end_date):
    delta = timedelta(hours=1)
    while start_date < end_date:
        yield start_date
        start_date  = delta
        
df = pd.DataFrame(columns = ['Time','pct0.1','pct0.2','pct0.3','pct0.4','pct0.5','pct0.6','pct0.7','pct0.8','pct0.9'])

for single_date in daterange(start_date, end_date):
            df.loc[single_date, ['Time']] = single_date.strftime("%H:%M")


x = []

for index in df.index:
    y = df_1.loc[df_1['Date'].str.contains(df['Time'][index])]
    for i in np.arange(1, 10, 1)/10:
        x.append(y.quantile(i))

pct = pd.DataFrame(x, columns = ['Energy'])

df['pct0.1'] = pct.loc[0.1].values
df['pct0.2'] = pct.loc[0.2].values
df['pct0.3'] = pct.loc[0.3].values
df['pct0.4'] = pct.loc[0.4].values
df['pct0.5'] = pct.loc[0.5].values
df['pct0.6'] = pct.loc[0.6].values
df['pct0.7'] = pct.loc[0.7].values
df['pct0.8'] = pct.loc[0.8].values
df['pct0.9'] = pct.loc[0.9].values

A part of the csv file:

2021/03/01 00:00:00;0
2021/03/01 01:00:00;0
2021/03/01 02:00:00;0
2021/03/01 03:00:00;0
2021/03/01 04:00:00;0
2021/03/01 05:00:00;0
2021/03/01 06:00:00;0
2021/03/01 07:00:00;0
2021/03/01 08:00:00;111
2021/03/01 09:00:00;609
2021/03/01 10:00:00;1152
2021/03/01 11:00:00;1596
2021/03/01 12:00:00;1919
2021/03/01 13:00:00;2062
2021/03/01 14:00:00;2003
2021/03/01 15:00:00;1755
2021/03/01 16:00:00;1351
2021/03/01 17:00:00;801
2021/03/01 18:00:00;233
2021/03/01 19:00:00;0
2021/03/01 20:00:00;0
2021/03/01 21:00:00;0
2021/03/01 22:00:00;0
2021/03/01 23:00:00;0
2021/03/02 00:00:00;0
2021/03/02 01:00:00;0
2021/03/02 02:00:00;0
2021/03/02 03:00:00;0
2021/03/02 04:00:00;0
2021/03/02 05:00:00;0
2021/03/02 06:00:00;0
2021/03/02 07:00:00;0
2021/03/02 08:00:00;92
2021/03/02 09:00:00;449
2021/03/02 10:00:00;905
2021/03/02 11:00:00;1387
2021/03/02 12:00:00;1516
2021/03/02 13:00:00;1617
2021/03/02 14:00:00;1671
2021/03/02 15:00:00;1525
2021/03/02 16:00:00;1290
2021/03/02 17:00:00;899
2021/03/02 18:00:00;312
2021/03/02 19:00:00;0
2021/03/02 20:00:00;0
2021/03/02 21:00:00;0
2021/03/02 22:00:00;0
2021/03/02 23:00:00;0

The dataframe created

    Time    pct0.1  pct0.2  pct0.3  pct0.4  pct0.5  pct0.6  pct0.7  pct0.8  pct0.9
2022-09-27 01:00:00 01:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 02:00:00 02:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 03:00:00 03:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 04:00:00 04:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 05:00:00 05:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 06:00:00 06:00   0.0 0.0 0.0 1.0 4.0 13.0    24.0    39.0    56.0
2022-09-27 07:00:00 07:00   25.0    51.0    84.0    148.0   180.0   249.0   297.0   378.0   426.0
2022-09-27 08:00:00 08:00   271.0   374.0   444.0   540.0   632.0   763.0   843.0   885.0   969.0
2022-09-27 09:00:00 09:00   545.0   743.0   880.0   1012.0  1079.0  1182.0  1285.0  1359.0  1401.0
2022-09-27 10:00:00 10:00   754.0   1037.0  1184.0  1315.0  1392.0  1478.0  1574.0  1654.0  1700.0
2022-09-27 11:00:00 11:00   858.0   1155.0  1291.0  1454.0  1540.0  1612.0  1695.0  1800.0  1853.0
2022-09-27 12:00:00 12:00   866.0   1185.0  1333.0  1452.0  1545.0  1630.0  1721.0  1820.0  1893.0
2022-09-27 13:00:00 13:00   871.0   1069.0  1239.0  1350.0  1435.0  1521.0  1612.0  1715.0  1817.0
2022-09-27 14:00:00 14:00   673.0   943.0   1036.0  1112.0  1207.0  1318.0  1383.0  1484.0  1622.0
2022-09-27 15:00:00 15:00   420.0   590.0   689.0   743.0   805.0   989.0   1045.0  1163.0  1297.0
2022-09-27 16:00:00 16:00   130.0   184.0   222.0   304.0   385.0   487.0   603.0   682.0   855.0
2022-09-27 17:00:00 17:00   1.0 3.0 7.0 30.0    55.0    96.0    152.0   207.0   338.0
2022-09-27 18:00:00 18:00   0.0 0.0 0.0 0.0 0.0 2.0 4.0 11.0    31.0
2022-09-27 19:00:00 19:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 20:00:00 20:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 21:00:00 21:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 22:00:00 22:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-27 23:00:00 23:00   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-09-28 00:00:00 00:00   0.0 0.0 0.0 0.0 5.0 163.0   626.1   1059.0  1439.7

CodePudding user response:

You calculate the quantiles by testing where the relevant time string is part of the date string in the input data. The date string consists of year, month, day, hour, minutes and seconds. The time string, however, only consists of the hour and minutes. With the input dates have all their seconds set to zero, your input date timestamps are like "01:00:00", "02:00:00" and so. Which is fine, but when comparing the strings (with in), there is a problem for the comparison time string "00:00": it fits all date strings: it just matches hours and minutes with minutes and seconds. E.g., "00:00" is in "02:00:00" evaluates to True. As a result, all(?) input data will be added to the "00:00" timestamp, instead of no input data.

So, an easy solution is to add seconds to your comparison times:

for single_date in daterange(start_date, end_date):
    df.loc[single_date, ['Time']] = single_date.strftime("%H:%M:%S")

Of course, there are likely better ways for comparing the timestamps with the datestamp, or calculating the quantiles, but this seems to be the actual issue behing your current problem.

  •  Tags:  
  • Related