This is a really tricky statistics that I want to produce. My dataframe contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df:
df = pd.DataFrame(
{'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75],
'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3],
'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1],
'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
}
)
df
trip segment class prediction
0 25 0 3 0
1 25 0 3 0
2 25 0 3 3
3 25 0 3 3
4 25 0 3 3
5 25 1 3 4
6 25 1 3 4
7 25 1 3 2
8 25 1 3 2
9 54 2 1 0
10 54 2 1 0
11 54 2 1 1
12 54 2 1 1
13 73 0 2 4
14 73 0 2 2
15 73 1 2 4
16 75 1 1 0
17 75 3 1 2
From the given df, I would like to produce statistics of model's predictions at trip and segment levels, using prediction's majority votes, considering the actual class a trip or segment belongs to.
Segment's statistics
So considering the above df, I would like to produce the below segment's statistics (explanation given below):
class total-segments correctly-predicted accuracy-rate
0 - - -
1 3 1 0.33
2 2 1 0.5
3 2 1 0.5
4 - - -
- no segment of
class0, so the dash. - there are 3 distinct segments of
classtype1(segment2of trip54and segments1&3of trip75). Of all the 3, only one (segment2of trip54has majority votes of itspredictioncorrect, so1correctly-predictedand0.33(i.e.1/3) accuracy-rate. - there're 2 segments belonging to
classtype2( segments0&1of trip73). Segment0has majority votes correct, so1correctly-predictedand0.5(i.e.1/2) accuracy-rate. - there're 2 segments of
class3(segments0&1of trip25). Segment0has majority votes correct, so1correctly-predictedand0.5(i.e.1/5) accuracy-rate. - no segment of
classtype4.
Trip-level statistics
Similarly, considering the class type of distinct trips in df and their prediction, I want to produce the following trip-level statistics (also explained below):
class total-trips correctly-predicted accuracy-rate
0 - - -
1 2 1 0.5
2 1 0 0.0
3 1 1 1.0
4 - - -
- no trip belongs to
class0. - 2 trips of
classtype1(trip54&75). 1 trip was predicted correct (majority votes of trip54), so1correctly-predictedtrip, and0.5accuracy-rate. - 1 trip of
class2(trip73). Its majority votes prediction is incorrect, so0correctly-predictedtrip, and0.0accuracy-rate. - 1 trip of
class3(trip25). Its majority votes prediction is correct (3), so1correctly-predictedtrip, and1.0accuracy-rate. - no trip of
class 4.
Please forgive the long grammar, but this is a problem that one can understand only when well-explained.
CodePudding user response:
You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.
res_seg = (
df['class'].eq(df['prediction'])
.groupby([df['class'],df['segment']]).mean()
.ge(0.5)
.groupby(level='class').agg(['size','sum'])
.rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
.assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
.reindex(range(5), fill_value='-')
.reset_index()
)
print(res_seg)
# class total_segments correctly_predicted accuracy_rate
# 0 0 - - -
# 1 1 3 1 0.333333
# 2 2 2 1 0.5
# 3 3 2 1 0.5
# 4 4 - - -
and similar for the trips, you would have to change the df['segment'] to df['trip'] in the groupby and maybe the name of the columns in the rename as well as the assign
