Classification

Computer says no



Distinguishing whether a thing was generated from distribution A or B. This is a learning-theory, loss-minimisation framing; We might also consider probability distributions of categories.

Mostly this page is a list of classification target losses to assess classifiers.

Classification loss zoo

Surprisingly subtle. ROC, AUC, precision/recall, confusion…

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions. The Pirates guide to various scores provides an easy introduction.

Matthews correlation coefficient

Due to Matthews (1975). This is the first choice for seamlessly handling multi-label problems within a single algorithm since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target loss in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.

2-class case

Take your \(2 \times 2\). confusion matrix of true positive, false positives etc.

\[ {\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}} \]

\[ |{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}} \]

Multiclass case

Take your \(K \times K\) confusion matrix \(C\), then

\[ {\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}} \]

ROC/AUC

Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in WWII. The graph of the false versus true positive rate as the criterion changes. Matthews (1975) talk about the AUC for radiology; Supposedly Spackman (1989) introduced it to machine learning, but I haven’t read the article in question. Allows us to trade off importance of false positive/false negatives.

Cross entropy

I’d better write down form for this, since most ML toolkits are curiously shy about it.

Let \(x\) be the estimated probability and \(z\) be the supervised class label. Then the binary cross entropy loss is

\[ \ell(x,z) = -z\log(x) - (1-z)\log(1-x) \]

If \(y=\operatorname{logit}(x)\) is not a probability but a logit, then the numerically stable version is

\[ \ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|)) \]

f-measure et al

🏗

Gumbel-max

See Gumbel-max tricks.

Pólya-Gamma augmentation

See Pólya-Gamma.

Unbalanced class problems

🏗

Hierarchical Multi-class classifiers

Read et al. (2021) discusses how to create multi-class classifiers by stacking layers of binary classifiers and using each as a feature input to the next, which is an elegant solution IMO.

Philosophical connection to semantics

Since semantics is what humans call classifiers.

Connection to legibility

I do think there is something interesting happening with legibility. States need to classify, apparently. Adversarial classification is my point of entry into this.

Relative distributions

Why characterise a difference in distributions by a summary statistic? Just have an object which is a relative distribution.

References

Arya, Gaurav, Moritz Schauer, Frank Schäfer, and Christopher Vincent Rackauckas. 2022. Automatic Differentiation of Programs with Discrete Randomness.” In.
Flach, Peter, José Hernández-Orallo, and Cesar Ferri. 2011. A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 657–64.
Gneiting, Tilmann, and Adrian E Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.
Goorbergh, Ruben van den, Maarten van Smeden, Dirk Timmerman, and Ben Van Calster. 2022. The Harm of Class Imbalance Corrections for Risk Prediction Models: Illustration and Simulation Using Logistic Regression.” Journal of the American Medical Informatics Association, June, ocac093.
Gorodkin, J. 2004. Comparing two K-category assignments by a K-category correlation coefficient.” Computational Biology and Chemistry 28 (5-6): 367–74.
Gozli, Davood. 2023. Principles of Categorization: A Synthesis.” Seeds of Science.
Grathwohl, Will, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris J. Maddison. 2021. Oops I Took A Gradient: Scalable Sampling for Discrete Distributions.” arXiv.
Hand, David J. 2009. Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23.
Hanley, J A, and B J McNeil. 1983. A Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology 148 (3): 839–43.
Huang, Yingxiang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. 2020. A Tutorial on Calibration Measurements and Calibration Models for Clinical Prediction Models.” Journal of the American Medical Informatics Association : JAMIA 27 (4): 621–33.
Jung, Alexander, Alfred O. Hero III, Alexandru Mara, and Saeed Jahromi. 2016. Semi-Supervised Learning via Sparse Label Propagation.” arXiv:1612.01414 [Cs, Stat], December.
Kim, Ilmun, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. 2021. Classification Accuracy as a Proxy for Two-Sample Testing.” The Annals of Statistics 49 (1): 411–34.
Lobo, Jorge M., Alberto Jiménez-Valverde, and Raimundo Real. 2008. AUC: A Misleading Measure of the Performance of Predictive Distribution Models.” Global Ecology and Biogeography 17 (2): 145–51.
Matthews, B. W. 1975. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.” Biochimica Et Biophysica Acta (BBA) - Protein Structure 405 (2): 442–51.
Menon, Aditya Krishna, and Robert C. Williamson. 2016. Bipartite Ranking: A Risk-Theoretic Perspective.” Journal of Machine Learning Research 17 (195): 1–102.
Nock, Richard, Aditya Krishna Menon, and Cheng Soon Ong. 2016. A Scaled Bregman Theorem with Applications.” arXiv:1607.00360 [Cs, Stat], July.
Polson, Nicholas G., James G. Scott, and Jesse Windle. 2013. Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association 108 (504): 1339–49.
Powers, David Martin. 2007. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.”
Read, Jesse, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank. 2021. Classifier Chains: A Review and Perspectives.” Journal of Artificial Intelligence Research 70 (February): 683–718.
Reid, Mark D., and Robert C. Williamson. 2011. Information, Divergence and Risk for Binary Experiments.” Journal of Machine Learning Research 12 (Mar): 731–817.
Spackman, Kent A. 1989. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning.” In Proceedings of the Sixth International Workshop on Machine Learning, edited by Alberto Maria Segre, 160–63. San Francisco (CA): Morgan Kaufmann.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.