Distinguishing whether a thing was generated from distribution A or B by optimisation. This is a learning-theory, loss-minimisation framing; We might also consider probability distributions of categories.
Multi-class
Precision/Recall and f-scores all work for multi-label classification, although this exacerbates bad qualities in unbalanced classes.
There are also surprising models here. Read et al. (2021) discusses how to create multi-class classifiers by stacking layers of binary classifiers and using each as a feature input to the next, which is an elegant solution IMO.
Relative distributions
Why characterise a difference in distributions by a summary statistic? Just have an object which is a relative distribution.
Classification loss zoo
Surprisingly subtle. ROC, AUC, precision/recall, confusion…
One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions. The Pirates guide to various scores provides an easy introduction.
Matthews correlation coefficient
Due to Matthews (1975). This is the first choice for seamlessly handling multi-label problems within a single algorithm since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.
However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target loss in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.
2-class case
Take your \(2 \times 2\). confusion matrix of true positive, false positives etc.
\[ {\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}} \]
\[ |{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}} \]
Multiclass case
Take your \(K \times K\) confusion matrix \(C\), then
\[ {\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}} \]
ROC/AUC
Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in WWII. The graph of the false versus true positive rate as the criterion changes. Matthews (1975) talk about the AUC for radiology; Supposedly Spackman (1989) introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.
Cross entropy
I’d better write down form for this, since most ML toolkits are curiously shy about it.
Let \(x\) be the estimated probability and \(z\) be the supervised class label. Then the binary cross entropy loss is
\[ \ell(x,z) = -z\log(x) - (1-z)\log(1-x) \]
If \(y=\operatorname{logit}(x)\) is not a probability but a logit, then the numerically stable version is
\[ \ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|)) \]
f-measure et al
🏗
Gumbel-max
See Gumbel-max tricks.
Pólya-Gamma augmentation
See Pólya-Gamma.
Unbalanced class problems
🏗
Philosophical connection to semantics
Since semantics is what humans call classifiers.
Connection to legibility
Seriously, I do think there is something interesting happening with legibility here. States need to classify, apparently.
1 comment
Judith Beadle