I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.
I wish to iterate through each day in the input DataFrame and do the following:
- Check that there is a row for each hour of the day
- Check that there is both maximum and minimum data for each hour of the day
If the conditions above are met, I wish to:
- Add a row to the output DataFrame for the given date
- Use the date to fill the 'Date_time' cell for the row
- Transpose the hourly data to the hourly cells
- Find the max of the hourly max data, and use it to fill the max cell for the row
- Find the min of the hourly min data, and use it to fill the min cell for the row
Example daily input data examples follow.
Example 1
All hours for day available
Max & min available for each hour
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp 0 2019-02-03 00:00:00 18.6 18.7 18.5 1 2019-02-03 01:00:00 18.6 18.7 18.5 2 2019-02-03 02:00:00 18.2 18.5 18.0 3 2019-02-03 03:00:00 18.0 18.0 17.9 4 2019-02-03 04:00:00 18.0 18.1 17.9 5 2019-02-03 05:00:00 18.3 18.4 18.1 6 2019-02-03 06:00:00 18.7 19.1 18.4 7 2019-02-03 07:00:00 20.1 21.3 19.1 8 2019-02-03 08:00:00 21.7 22.9 21.0 9 2019-02-03 09:00:00 23.2 23.9 22.8 10 2019-02-03 10:00:00 23.7 24.1 23.3 11 2019-02-03 11:00:00 24.6 25.5 24.0 12 2019-02-03 12:00:00 25.1 25.7 24.7 13 2019-02-03 13:00:00 24.5 25.0 24.2 14 2019-02-03 14:00:00 23.9 25.3 21.2 15 2019-02-03 15:00:00 19.6 21.2 18.8 16 2019-02-03 16:00:00 19.2 19.5 18.7 17 2019-02-03 17:00:00 19.8 19.9 19.4 18 2019-02-03 18:00:00 19.6 19.8 19.5 19 2019-02-03 19:00:00 19.3 19.4 19.1 20 2019-02-03 20:00:00 19.2 19.4 19.1 21 2019-02-03 21:00:00 19.3 19.4 18.9 22 2019-02-03 22:00:00 18.8 19.0 18.7 23 2019-02-03 23:00:00 19.0 19.1 18.9
Example 2
All hours for day available
Max & min available for each hour
NaN values for some Mean_temp entries
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp 24 2019-02-04 00:00:00 18.9 19.0 18.9 25 2019-02-04 01:00:00 18.8 18.9 18.7 26 2019-02-04 02:00:00 18.6 18.8 18.4 27 2019-02-04 03:00:00 18.4 18.6 18.1 28 2019-02-04 04:00:00 18.7 18.9 18.4 29 2019-02-04 05:00:00 18.8 18.8 18.7 30 2019-02-04 06:00:00 19.0 19.3 18.8 31 2019-02-04 07:00:00 19.7 20.4 19.3 32 2019-02-04 08:00:00 21.4 22.8 20.3 33 2019-02-04 09:00:00 23.5 23.9 22.8 34 2019-02-04 10:00:00 25.7 23.6 35 2019-02-04 11:00:00 26.5 25.4 36 2019-02-04 12:00:00 27.1 26.1 37 2019-02-04 13:00:00 25.8 26.8 24.8 38 2019-02-04 14:00:00 25.4 27.8 23.7 39 2019-02-04 15:00:00 22.1 24.1 20.2 40 2019-02-04 16:00:00 21.8 22.6 20.2 41 2019-02-04 17:00:00 20.9 22.4 19.6 42 2019-02-04 18:00:00 18.9 19.6 18.6 43 2019-02-04 19:00:00 18.8 18.9 18.6 44 2019-02-04 20:00:00 18.9 19.0 18.8 45 2019-02-04 21:00:00 18.8 18.9 18.7 46 2019-02-04 22:00:00 18.8 18.9 18.7 47 2019-02-04 23:00:00 18.9 19.2 18.7
Example 3
Not all hours of the day are available
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp 48 2019-02-05 00:00:00 19.2 19.3 19.0 49 2019-02-05 01:00:00 19.3 19.4 19.3 50 2019-02-05 02:00:00 19.3 19.4 19.2 51 2019-02-05 03:00:00 19.4 19.5 19.4 52 2019-02-05 04:00:00 19.5 19.6 19.3 53 2019-02-05 05:00:00 19.3 19.5 19.1 54 2019-02-05 06:00:00 20.1 20.6 19.2 55 2019-02-05 07:00:00 21.1 21.7 20.6 56 2019-02-05 08:00:00 22.3 23.2 21.7 57 2019-02-05 15:00:00 25.3 25.8 25.0 58 2019-02-05 16:00:00 25.8 26.0 25.2 59 2019-02-05 17:00:00 24.3 25.2 23.3 60 2019-02-05 18:00:00 22.5 23.3 22.1 61 2019-02-05 19:00:00 21.6 22.1 21.1 62 2019-02-05 20:00:00 21.1 21.3 20.9 63 2019-02-05 21:00:00 21.2 21.3 20.9 64 2019-02-05 22:00:00 20.9 21.0 20.6 65 2019-02-05 23:00:00 19.9 20.6 19.7
Example 4
All hours of the day are available
Max and/or min have at least one NaN value
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp 66 2019-02-06 00:00:00 19.7 19.8 19.7 67 2019-02-06 01:00:00 19.6 19.7 19.3 68 2019-02-06 02:00:00 19.0 19.3 18.6 69 2019-02-06 03:00:00 18.5 18.6 18.4 70 2019-02-06 04:00:00 18.6 18.7 18.4 71 2019-02-06 05:00:00 18.5 18.6 72 2019-02-06 06:00:00 19.0 19.6 18.5 73 2019-02-06 07:00:00 20.3 21.2 19.6 74 2019-02-06 08:00:00 21.5 21.7 21.2 75 2019-02-06 09:00:00 21.4 22.3 20.9 76 2019-02-06 10:00:00 23.5 24.4 22.3 77 2019-02-06 11:00:00 24.7 25.4 24.3 78 2019-02-06 12:00:00 24.9 25.5 23.9 79 2019-02-06 13:00:00 23.4 24.0 22.9 80 2019-02-06 14:00:00 23.3 23.8 22.9 81 2019-02-06 15:00:00 24.4 23.7 82 2019-02-06 16:00:00 24.9 25.1 24.7 83 2019-02-06 17:00:00 24.4 24.9 23.8 84 2019-02-06 18:00:00 22.5 23.8 21.7 85 2019-02-06 19:00:00 20.8 21.8 19.6 86 2019-02-06 20:00:00 19.1 19.6 18.9 87 2019-02-06 21:00:00 19.0 19.1 18.9 88 2019-02-06 22:00:00 19.1 19.1 19.0 89 2019-02-06 23:00:00 19.1 19.1 19.0
Just to recap, the above inputs would create the following output:
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!
Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?
CodePudding user response:
You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)
Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])
all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time'] all_hours ['Max','Min'])
## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
## check if NaN is in either 'Max Temp' or 'Min Temp' columns
new_df_data = {}
if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
## create a dictionary for the new row of the final_df
new_df_data['Date_time'] = group
new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
new_df_data['Max'], new_df_data['Min'] = df_group['Max_temp'].max(), df_group['Min_temp'].min()
final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
else:
continue
Output:
>>> final_df
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.2 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
0 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 NaN NaN NaN 25.8 25.4 22.1 21.8 20.9 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
