Does f1 score really depend on which class is given the positive label?
When I use scikit learn's f1 metric, it seems to:
>>> from sklearn import metrics as m
>>> m.f1_score([0,0,0,1,1,1],[0,0,0,1,1,0])
0.8
>>> m.f1_score([1,1,1,0,0,0],[1,1,1,0,0,1])
0.8571428571428571
The only difference between the first and second case is that 0 and 1 have been swapped. But I get a different answer.
This seems really bad. It means that if I'm reporting the f1 score for a cat/dog classifier, the value depends on whether cats or dogs get the positive label.
Is this really true, or did I mess something up?
CodePudding user response:
For multiclass classification you should use a 
Were tp is for true positive rate, fn is false negative rate.
I will use a ' to denote the measures for swapped labels.
By swapping labels we have tn'=tp, fn'=fp, fp'=fn, tp'=tn.
If you want
F1'=F1. We have tp/(tp (fn fp)/2)=tp'/(tp' (fn' fp')/2)=tn/(tn (fn fp)/2). That is satisfied if, and only if, tp=tn.
