Home > Software design >  How to sample data from Pandas DataFrame where data is present for every hour of a given day
How to sample data from Pandas DataFrame where data is present for every hour of a given day

Time:01-04

I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):

     Date_time  00:00   01:00   02:00   03:00   04:00   05:00   06:00   07:00   08:00   09:00   10:00   11:00   12:00   13:00   14:00   15:00   16:00   17:00   18:00   19:00   20:00   21:00   22:00   23:00   Max     Min
0   2019-02-03  18.6    18.6    18.2    18.0    18.0    18.3    18.7    20.1    21.7    23.3    23.7    24.6    25.1    24.5    23.9    19.6    19.2    19.8    19.6    19.3    19.2    19.3    18.8    19.0    25.7    17.9
1   2019-02-04  18.9    18.8    18.6    18.4    18.7    18.8    19.0    19.7    21.4    23.5                            25.8    25.4    22.1    21.8    21.0    18.9    18.8    18.9    18.8    18.8    18.9    27.8    18.1

My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.

I wish to iterate through each day in the input DataFrame and do the following:

  • Check that there is a row for each hour of the day
  • Check that there is both maximum and minimum data for each hour of the day

If the conditions above are met, I wish to:

  • Add a row to the output DataFrame for the given date
  • Use the date to fill the 'Date_time' cell for the row
  • Transpose the hourly data to the hourly cells
  • Find the max of the hourly max data, and use it to fill the max cell for the row
  • Find the min of the hourly min data, and use it to fill the min cell for the row

Example daily input data examples follow.

Example 1

  • All hours for day available

  • Max & min available for each hour

  • Proceed to create row in output DataFrame

          Date_time           Mean_temp Max_temp Min_temp
     0    2019-02-03 00:00:00 18.6      18.7     18.5
     1    2019-02-03 01:00:00 18.6      18.7     18.5
     2    2019-02-03 02:00:00 18.2      18.5     18.0
     3    2019-02-03 03:00:00 18.0      18.0     17.9
     4    2019-02-03 04:00:00 18.0      18.1     17.9
     5    2019-02-03 05:00:00 18.3      18.4     18.1
     6    2019-02-03 06:00:00 18.7      19.1     18.4
     7    2019-02-03 07:00:00 20.1      21.3     19.1
     8    2019-02-03 08:00:00 21.7      22.9     21.0
     9    2019-02-03 09:00:00 23.2      23.9     22.8
     10   2019-02-03 10:00:00 23.7      24.1     23.3
     11   2019-02-03 11:00:00 24.6      25.5     24.0
     12   2019-02-03 12:00:00 25.1      25.7     24.7
     13   2019-02-03 13:00:00 24.5      25.0     24.2
     14   2019-02-03 14:00:00 23.9      25.3     21.2
     15   2019-02-03 15:00:00 19.6      21.2     18.8
     16   2019-02-03 16:00:00 19.2      19.5     18.7
     17   2019-02-03 17:00:00 19.8      19.9     19.4
     18   2019-02-03 18:00:00 19.6      19.8     19.5
     19   2019-02-03 19:00:00 19.3      19.4     19.1
     20   2019-02-03 20:00:00 19.2      19.4     19.1
     21   2019-02-03 21:00:00 19.3      19.4     18.9
     22   2019-02-03 22:00:00 18.8      19.0     18.7
     23   2019-02-03 23:00:00 19.0      19.1     18.9
    

Example 2

  • All hours for day available

  • Max & min available for each hour

  • NaN values for some Mean_temp entries

  • Proceed to create row in output DataFrame

          Date_time           Mean_temp Max_temp Min_temp
     24   2019-02-04 00:00:00 18.9      19.0     18.9
     25   2019-02-04 01:00:00 18.8      18.9     18.7
     26   2019-02-04 02:00:00 18.6      18.8     18.4
     27   2019-02-04 03:00:00 18.4      18.6     18.1
     28   2019-02-04 04:00:00 18.7      18.9     18.4
     29   2019-02-04 05:00:00 18.8      18.8     18.7
     30   2019-02-04 06:00:00 19.0      19.3     18.8
     31   2019-02-04 07:00:00 19.7      20.4     19.3
     32   2019-02-04 08:00:00 21.4      22.8     20.3
     33   2019-02-04 09:00:00 23.5      23.9     22.8
     34   2019-02-04 10:00:00           25.7     23.6
     35   2019-02-04 11:00:00           26.5     25.4
     36   2019-02-04 12:00:00           27.1     26.1
     37   2019-02-04 13:00:00 25.8      26.8     24.8
     38   2019-02-04 14:00:00 25.4      27.8     23.7
     39   2019-02-04 15:00:00 22.1      24.1     20.2
     40   2019-02-04 16:00:00 21.8      22.6     20.2
     41   2019-02-04 17:00:00 20.9      22.4     19.6
     42   2019-02-04 18:00:00 18.9      19.6     18.6
     43   2019-02-04 19:00:00 18.8      18.9     18.6
     44   2019-02-04 20:00:00 18.9      19.0     18.8
     45   2019-02-04 21:00:00 18.8      18.9     18.7
     46   2019-02-04 22:00:00 18.8      18.9     18.7
     47   2019-02-04 23:00:00 18.9      19.2     18.7
    

Example 3

  • Not all hours of the day are available

  • Do not create row in output DataFrame

          Date_time           Mean_temp Max_temp Min_temp
     48   2019-02-05 00:00:00 19.2      19.3     19.0
     49   2019-02-05 01:00:00 19.3      19.4     19.3
     50   2019-02-05 02:00:00 19.3      19.4     19.2
     51   2019-02-05 03:00:00 19.4      19.5     19.4
     52   2019-02-05 04:00:00 19.5      19.6     19.3
     53   2019-02-05 05:00:00 19.3      19.5     19.1
     54   2019-02-05 06:00:00 20.1      20.6     19.2
     55   2019-02-05 07:00:00 21.1      21.7     20.6
     56   2019-02-05 08:00:00 22.3      23.2     21.7
     57   2019-02-05 15:00:00 25.3      25.8     25.0
     58   2019-02-05 16:00:00 25.8      26.0     25.2
     59   2019-02-05 17:00:00 24.3      25.2     23.3
     60   2019-02-05 18:00:00 22.5      23.3     22.1
     61   2019-02-05 19:00:00 21.6      22.1     21.1
     62   2019-02-05 20:00:00 21.1      21.3     20.9
     63   2019-02-05 21:00:00 21.2      21.3     20.9
     64   2019-02-05 22:00:00 20.9      21.0     20.6
     65   2019-02-05 23:00:00 19.9      20.6     19.7
    

Example 4

  • All hours of the day are available

  • Max and/or min have at least one NaN value

  • Do not create row in output DataFrame

          Date_time           Mean_temp Max_temp Min_temp
     66   2019-02-06 00:00:00 19.7      19.8     19.7
     67   2019-02-06 01:00:00 19.6      19.7     19.3
     68   2019-02-06 02:00:00 19.0      19.3     18.6
     69   2019-02-06 03:00:00 18.5      18.6     18.4
     70   2019-02-06 04:00:00 18.6      18.7     18.4
     71   2019-02-06 05:00:00 18.5      18.6     
     72   2019-02-06 06:00:00 19.0      19.6     18.5
     73   2019-02-06 07:00:00 20.3      21.2     19.6
     74   2019-02-06 08:00:00 21.5      21.7     21.2
     75   2019-02-06 09:00:00 21.4      22.3     20.9
     76   2019-02-06 10:00:00 23.5      24.4     22.3
     77   2019-02-06 11:00:00 24.7      25.4     24.3
     78   2019-02-06 12:00:00 24.9      25.5     23.9
     79   2019-02-06 13:00:00 23.4      24.0     22.9
     80   2019-02-06 14:00:00 23.3      23.8     22.9
     81   2019-02-06 15:00:00 24.4               23.7
     82   2019-02-06 16:00:00 24.9      25.1     24.7
     83   2019-02-06 17:00:00 24.4      24.9     23.8
     84   2019-02-06 18:00:00 22.5      23.8     21.7
     85   2019-02-06 19:00:00 20.8      21.8     19.6
     86   2019-02-06 20:00:00 19.1      19.6     18.9
     87   2019-02-06 21:00:00 19.0      19.1     18.9
     88   2019-02-06 22:00:00 19.1      19.1     19.0
     89   2019-02-06 23:00:00 19.1      19.1     19.0
    

Just to recap, the above inputs would create the following output:

     Date_time  00:00   01:00   02:00   03:00   04:00   05:00   06:00   07:00   08:00   09:00   10:00   11:00   12:00   13:00   14:00   15:00   16:00   17:00   18:00   19:00   20:00   21:00   22:00   23:00   Max     Min
0   2019-02-03  18.6    18.6    18.2    18.0    18.0    18.3    18.7    20.1    21.7    23.3    23.7    24.6    25.1    24.5    23.9    19.6    19.2    19.8    19.6    19.3    19.2    19.3    18.8    19.0    25.7    17.9
1   2019-02-04  18.9    18.8    18.6    18.4    18.7    18.8    19.0    19.7    21.4    23.5                            25.8    25.4    22.1    21.8    21.0    18.9    18.8    18.9    18.8    18.8    18.9    27.8    18.1

I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!

Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?

CodePudding user response:

You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)

Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])

all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time']   all_hours   ['Max','Min'])

## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
    ## check if NaN is in either 'Max Temp' or 'Min Temp' columns
    new_df_data = {}
    if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
        ## create a dictionary for the new row of the final_df
        new_df_data['Date_time'] = group
        new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
        new_df_data['Max'], new_df_data['Min'] =  df_group['Max_temp'].max(), df_group['Min_temp'].min()
        final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
    else:
        continue

Output:

>>> final_df
    Date_time  00:00  01:00  02:00  03:00  04:00  05:00  06:00  07:00  08:00  09:00  10:00  11:00  12:00  13:00  14:00  15:00  16:00  17:00  18:00  19:00  20:00  21:00  22:00  23:00   Max   Min
0  2019-02-03   18.6   18.6   18.2   18.0   18.0   18.3   18.7   20.1   21.7   23.2   23.7   24.6   25.1   24.5   23.9   19.6   19.2   19.8   19.6   19.3   19.2   19.3   18.8   19.0  25.7  17.9
0  2019-02-04   18.9   18.8   18.6   18.4   18.7   18.8   19.0   19.7   21.4   23.5    NaN    NaN    NaN   25.8   25.4   22.1   21.8   20.9   18.9   18.8   18.9   18.8   18.8   18.9  27.8  18.1
  •  Tags:  
  • Related