Multi-label
Precision/Recall and f-scores all work for multi-label classification, although they have bad qualities in unbalanced classes.
Unbalanced class problems
🏗
Calibration
In the context of binary classification, calibration refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1.
The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines).
He recommends the tutorial Huang et al. (2020) and associated github.
Metric Zoo
One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions. The Pirates guide to various scores provides an easy introduction.
Matthews correlation coefficient
Due to Matthews (Matthews 1975) This is the first choice for seamlessly handling multi-label problems, since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.
However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.
2-class case
Take your \(2 \times 2\). confusion matrix of true positive, false positives etc.
\[ {\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}} \]
\[ |{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}} \]
Multiclass case
Take your \(K \times K\) confusion matrix \(C\), then
\[ {\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}} \]
ROC/AUC
Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in the mid-century. (Matthews 1975) talk about the AUC for radiology; Supposedly (Matthews 1975)introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.
Cross entropy
I’d better write down form for this, since most ML toolkits are curiously shy about it.
Let \(x\) be the estimated probability and \(z\) be the supervised class label. Then the binary cross entropy loss is
\[ \ell(x,z) = -z\log(x) - (1-z)\log(1-x) \]
If \(y=\operatorname{logit}(x)\) is not a probability but a logit, then the numerically stable version is
\[ \ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|)) \]
f-measure et al
🏗
Philosophical connection to semantics
Since semantics is what humans call classifiers.