Pandas - What datatype should a duration column (mm:ss) be to use aggregates on it?-CodePudding

I'm doing some NBA analysis and have a "Minutes Played" column for players in a mm:ss format. What dtype should this column be to perform aggregate functions (mean, min, max, etc...) on it? The df has over 20,000 rows, so here is a sample of the column in question:

I ran this code to change the format to datetime -

df['Minutes'] = pd.to_datetime(df['Minutes'], format='%M:%S', errors='coerce')

it changed the dtype successfully, but I am still unable to perform operations on the column. I am met with this error when trying to aggregate the column:

DataError: No numeric types to aggregate

My code for the aggregate

df2 = df.groupby(['Name', 'Team']).agg({'Minutes' : 'mean'})

I would like to be able to see the average # of minutes and retain the mm:ss format.

Any help is appreciated.

CodePudding user response：

import pandas as pd
data = {
    'Minutes': ['18:30', '24:50', '33:21', '28:39', '27:30'],
    'Team': ['team1', 'team2', 'team1', 'team1', 'team2']
}

df = pd.DataFrame(data)
df['Minutes'] = pd.to_timedelta('00:'   df['Minutes'].replace('',np.NaN)))
df.groupby('Team')['Minutes'].mean()

output:

>>>
Team
team1   0 days 00:26:50
team2   0 days 00:26:10
Name: Minutes, dtype: timedelta64[ns]

CodePudding user response：

I would add another column as a float derived from dt properties to do your aggregation on.

 df_dict = {'Minutes': {0: Timestamp('1900-01-01 00:18:30'),
  1: Timestamp('1900-01-01 00:24:50'),
  2: Timestamp('1900-01-01 00:33:21'),
  3: Timestamp('1900-01-01 00:28:39'),
  4: Timestamp('1900-01-01 00:27:30')},
 'points': {0: 22, 1: 34, 2: 25, 3: 31, 4: 33},
 'name': {0: 'kobe', 1: 'jordan', 2: 'kobe', 3: 'jordan', 4: 'durant'},
 'team': {0: 'lakers', 1: 'bulls', 2: 'lakers', 3: 'bulls', 4: 'nets'}}


df = pd.DataFrame(df_dict)


df['mins_float'] = df['Minutes'].dt.hour * 60   df['Minutes'].dt.minute   df['Minutes'].dt.second/60

df_gb = df.groupby(['name', 'team']).agg({'mins_float': 'mean'})

print(df_gb)

                mins_float
name    team    
durant  nets    27.500000
jordan  bulls   26.741667
kobe    lakers  25.925000

You can add this if you want to get back to a hms string.

import datetime

df_gb['hms'] = df_gb['mins_float'].apply(lambda x:str(datetime.timedelta(minutes=x)))

print(df_gb)

                mins_float  hms
name    team        
durant  nets    27.500000   0:27:30
jordan  bulls   26.741667   0:26:44.500000
kobe    lakers  25.925000   0:25:55.500000