A process records a batch of data (e.g. batch_size = 27) from multiple iterations (e.g. n_iterations = 10) and flattens them into a CSV file. In this case, the file would have 270 rows. I need to group/aggregate the data by batch to be able to calculate metrics like average, median and standard deviation on the Duration field. The entries in the Name field repeat and have the same order within each batch, so df[0]["Name"] == df[batch_size]["Name"], but might not be unique within each batch.
Name Duration
0 A 9.664
1 B 25.184
2 A 10.656
3 C 9.408
4 D 8.608
... ... ...
265 A 9.920
266 D 4.064
267 A 11.616
268 B 13.664
269 A 9.440
The final result should be a dataframe with batch_size rows that contains all Durations as columns Duration_{i}, or as a list [Duration_0, Duration_1, ...], or directly the statistical metrics Duration_mean, Duration_median, ... as columns. Any format is acceptable.
Name Duration_0 Duration_1 ...
0 A 9.664 7.221 ...
1 B 25.184 24.110 ...
... ... ... ... ...
25 B 13.664 12.853 ...
26 A 9.440 8.502 ...
or
Name Duration
0 A [9.664, 7.221, ...]
1 B [25.184, 24.110, ...
... ... ...
25 B [13.664, 12.853, ...]
26 A [9.440, 8.502, ...]
or
Name Duration_mean Duration_median ...
0 A 9.552 8.997 ...
1 B 24.444 24.321 ...
...
25 B 13.521 13.415 ...
26 A 9.440 9.502 ...
How can this be done with Pandas?
EDIT: Simpler data for verification with batch_size=3 and n_iterations=2:
Input:
Name Duration
0 A 1 -
1 B 2 | batch_0
2 A 2 -
3 A 5 -
4 B 2 | batch_1
5 A 4 -
Output:
Name mean
0 A 3 // from (1 5)/2
1 B 2 // from (2 2)/2
2 A 5 // from (0 10)/2
CodePudding user response:
If need aggregate per n_batch rows with first value of Name per groups and Duration aggregate by mean and median use:
n_batch = 3
g = df.index % n_batch
df = df.groupby(g).agg(Neme=('Name','first'),
Duration_mean=('Duration','mean'),
Duration_median=('Duration','median') )
print (df)
Neme Duration_mean Duration_median
0 A 3 3
1 B 2 2
2 A 3 3
