I have a Pandas DataFrame with a start column of dtype of datetime64[ns, UTC] and the DataFrame is sorted in ascending order based on the start column. From this DataFrame I used the following to create a new (updated) DataFrame indicating the day of the week for the start column
format_datetime_df['day_of_week'] = format_datetime_df['start'].dt.dayofweek
I want to pass the DataFrame into a function. The function needs to loop through the days of the week, so from 0 to 6, and keep a running total of the distance (kept in column 'distance') covered. If the distance covered is greater than 15, then a counter is incremented. It needs to do this for all rows of the DataFrame. The return of the function will be the total number of weeks over 15.
I am getting stuck on how to implement this as my 'day_of_week' column starts as follows
3
3
5
1
5
So, week 1 would be comprised of 3, 3, 5 and week 2 would be comprised of 1, 5, ...
I want to do something like
number_of_weeks_over_10km = format_datetime_df.groupby().apply(weeks_over_10km)
but am not really sure what should go in the groupby() function. I also feel like I am overcomplicating this.
CodePudding user response:
It was complicated, but I figured it out. Here is the basic flow of what I did
# Create a helper index that allows iteration by week while also considering the year
# Function to return the total distance for each week
# Create a NumPy array to store the total distance for each week
# Append the total distance for each week to the array
# Count the number of times the total distance for each week was > x (in km)
The helper index that allowed for iteration by week while also considering the year came from another post here on Stack Overflow (Iterate over pd df with date column by week python). This had a consequence though, in that I had to create and append the NumPy array outside of the function in order to get everything to work.
CodePudding user response:
I guess you can solve that using Pandas without functions. Just determine year and week using
df["isoweek"] = (df["start"].dt.isocalendar()["year"].astype(str)
" "
df["start"].dt.isocalendar()["week"].astype(str)
)
Then you determine the distance using a groupby and count the entries above 15:
weeks_above_15 = (df.groupby("isoweek")["distance"].sum() > 15).sum()
