# Classification

## Computer says no

Distinguishing whether a thing was generated by distribution A or B. This is a learning-problem framing; We might also consider probability distributions of categories.

## Multi-class

Precision/Recall and f-scores all work for multi-label classification, although this exacerbates bad qualities in unbalanced classes.

There are also surprising models here. Read et al. (2021) discusses how to create multi-class classifiers by stacking layers of binary classifiers and using each as a feature input to the next, which is an elegant solution IMO.

## Relative distributions

Why characterise a difference in distributions by a summary statistic? Just have an object which is a relative distribution.

## Probabilistic classification: Calibration

Kenneth Tay says

In the context of binary classification, calibration refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1.

The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines).

He recommends the tutorial Huang et al. (2020) and associated github.

## Classification loss zoo

Surprisingly subtle. ROC, AUC, precision/recall, confusion…

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions. The Pirates guide to various scores provides an easy introduction.

### Matthews correlation coefficient

Due to Matthews (1975). This is the first choice for seamlessly handling multi-label problems within a single algorithm since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target loss in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.

#### 2-class case

Take your $$2 \times 2$$. confusion matrix of true positive, false positives etc.

${\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}$

$|{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}}$

#### Multiclass case

Take your $$K \times K$$ confusion matrix $$C$$, then

${\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}}$

### ROC/AUC

Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in WWII. The graph of the false versus true positive rate as the criterion changes. Matthews (1975) talk about the AUC for radiology; Supposedly Spackman (1989) introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.

### Cross entropy

I’d better write down form for this, since most ML toolkits are curiously shy about it.

Let $$x$$ be the estimated probability and $$z$$ be the supervised class label. Then the binary cross entropy loss is

$\ell(x,z) = -z\log(x) - (1-z)\log(1-x)$

If $$y=\operatorname{logit}(x)$$ is not a probability but a logit, then the numerically stable version is

$\ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|))$

🏗

## Philosophical connection to semantics

Since semantics is what humans call classifiers.

## Pólya-Gamma distribution

An infinite sum of Gamma RVs which is useful in Bayesian Binomial regression (and maybe other things?) .

🏗

## References

Flach, Peter, José Hernández-Orallo, and Cesar Ferri. 2011. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 657–64.
Gneiting, Tilmann, and Adrian E Raftery. 2007. Journal of the American Statistical Association 102 (477): 359–78.
Gorodkin, J. 2004. Computational Biology and Chemistry 28 (5-6): 367–74.
Hand, David J. 2009. Machine Learning 77 (1): 103–23.
Hanley, J A, and B J McNeil. 1983. Radiology 148 (3): 839–43.
Huang, Yingxiang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. 2020. Journal of the American Medical Informatics Association : JAMIA 27 (4): 621–33.
Jung, Alexander, Alfred O. Hero III, Alexandru Mara, and Saeed Jahromi. 2016. arXiv:1612.01414 [Cs, Stat], December.
Kim, Ilmun, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. 2021. The Annals of Statistics 49 (1): 411–34.
Lobo, Jorge M., Alberto Jiménez-Valverde, and Raimundo Real. 2008. Global Ecology and Biogeography 17 (2): 145–51.
Matthews, B. W. 1975. Biochimica Et Biophysica Acta (BBA) - Protein Structure 405 (2): 442–51.
Menon, Aditya Krishna, and Robert C. Williamson. 2016. Journal of Machine Learning Research 17 (195): 1–102.
Nock, Richard, Aditya Krishna Menon, and Cheng Soon Ong. 2016. arXiv:1607.00360 [Cs, Stat], July.
Polson, Nicholas G., James G. Scott, and Jesse Windle. 2013. Journal of the American Statistical Association 108 (504): 1339–49.
Powers, David Martin. 2007.
Read, Jesse, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank. 2021. Journal of Artificial Intelligence Research 70 (February): 683–718.
Reid, Mark D., and Robert C. Williamson. 2011. Journal of Machine Learning Research 12 (Mar): 731–817.
Spackman, Kent A. 1989. In Proceedings of the Sixth International Workshop on Machine Learning, edited by Alberto Maria Segre, 160–63. San Francisco (CA): Morgan Kaufmann.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.