Let's say I have a pd.DataFrame with the columns "dir" and "speed":
import pandas as pd
df = pd.DataFrame({'dir': ['fwd', 'fwd', 'fwd', 'bwd', 'bwd'],
'speed': [10, 5, 1, 6, 8]})
# or with more columns:
df = pd.DataFrame({'dir': ['fwd', 'fwd', 'fwd', 'bwd', 'bwd'],
'speed': [10, 5, 1, 6, 8],
'mass': [100, 200, 100, 500, 300]})
dir speed
0 fwd 10
1 fwd 5
2 fwd 1
3 bwd 6
4 bwd 8
I'm trying to calculate 3 things, with the result being a DataFrame with 1 row, containing "median_speed", "median_fwd_speed", "median_bwd_speed".
I'm really new to Pandas so forgive my horrible upcoming mistakes. Also, I have a lot of other stuff being calculated, so keeping agg is definitely preferable, but doing away with np.where() would be great.
What I have so far:
# duplicate dir column for future referencing
df['dir2'] = df['dir']
# groupby and calc median for fwd and bwd
df = df.groupby('dir').agg({"dir2": lambda x: x.iloc[0], # how do I do nothing with agg?
"speed": "median"})
# grab forward and bwd fields
df['median_fwd_speed'] = np.where(df['dir2'] == 'fwd', df['speed'], 0)
df['median_bwd_speed'] = np.where(df['dir2'] == 'bwd', df['speed'], 0)
Output:
dir2 speed median_fwd_speed median_bwd_speed
dir
bwd bwd 7.0 0.0 7.0
fwd fwd 5.0 5.0 0.0
Of course the output is not 1 row, and doesn't contain the total median. Any help would be appreciated!
I could probably use df["speed"].median() and store it as a variable, but is there an elegant way using just groupby and agg?
Expected output with multiple columns would be something like:
median_speed fwd_median_speed bwd_median_speed median_mass fwd_median_mass bwd_median_mass
6 5 7 200 100 400
CodePudding user response:
You can aggregate median and then add new column for median:
f = lambda x: f'median_{x}_speed'
df1=df.groupby('dir')[['speed']].median().rename(f).T.assign(median = df['speed'].median())
print (df1)
dir median_bwd_speed median_fwd_speed median
speed 7 5 6.0
CodePudding user response:
I would extract each value, place it in a list and convert the list to a DataFrame so the values are represented in a single row.
metrics = [
df[df['dir'=='fwd']]['speed'].median()
df[df['dir'=='bwd']]['speed'].median()
df['speed'].median()
]
pd.DataFrame([metrics], columns=['median_fwd', 'median_bwd', 'median_speed'])
CodePudding user response:
You'll need to create a separate dataframe to hold the result. You can get the medians in two operations: a direct median, and a median on the groupby:
pd.DataFrame([[df['speed'].median(), *df.groupby('dir')['speed'].median()]], columns=['median_speed', 'median_bwd_speed', 'median_fwd_speed'])
CodePudding user response:
Something like this should work
(df.groupby('dir')['speed']
.agg('median')
.append(pd.Series(index = ['total'], data = df['speed'].median()))
.to_frame()
.T
)
output:
bwd fwd total
0 7.0 5.0 6.0
you can further rename columns if you want using .rename(columns = ...) method
Edit
with multiple columns this should work
(df.groupby('dir')
.agg('median')
.append(df.median().rename('total') )
.unstack()
.to_frame()
.T
)
output
speed mass
dir bwd fwd total bwd fwd total
0 7.0 5.0 6.0 400.0 100.0 200.0
