# Classification

Computer says no

February 20, 2017 — September 13, 2024

classification
metrics
statistics

Distinguishing whether a thing was generated from distribution A or B. This is a learning-theory, loss-minimisation framing; We might also consider probability distributions of categories.

Mostly this page is a list of classification target losses to assess classifiers.

## 1 Classification loss zoo

Surprisingly subtle. ROC, AUC, precision/recall, confusion…

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions. The Pirates guide to various scores provides an easy introduction.

## 2 But actually just use expected cost

“No Need for Ad-Hoc Substitutes: The Expected Cost Is a Principled All-Purpose Classification Metric” (2024);Ferrer (2023);Dyrland, Lundervold, and Mana (2023). TBC

### 2.1 Matthews correlation coefficient

Due to Matthews (1975). This is the first choice for seamlessly handling multi-label problems within a single algorithm since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t directly use it as, e.g., a target loss in neural nets; Therefore you use surrogate measures which are differentiable and intermittently check that it actually help the MCC.

I tell ya what, though, it looks like it could be made differentiable via a relaxation, and variationally distributional if we interpreted it in a likelihood context. Hmm.

#### 2.1.1 2-class case

Take your $$2 \times 2$$. confusion matrix of true positive, false positives etc.

${\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}$

$|{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}}$

#### 2.1.2 Multiclass case

Take your $$K \times K$$ confusion matrix $$C$$, then

${\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}}$

### 2.2 ROC/AUC

Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in WWII. The graph of the false versus true positive rate as the criterion changes. Matthews (1975) talk about the AUC for radiology; Supposedly Spackman (1989) introduced it to machine learning, but I haven’t read the article in question. Allows us to trade off importance of false positive/false negatives.

### 2.3 Cross entropy

I’d better write down an explicit form for this, since most ML toolkits are curiously shy about giving it even though it’s the default.

Let $$x$$ be the estimated probability and $$z$$ be the supervised class label. Then the binary cross entropy loss is

$\ell(x,z) = -z\log(x) - (1-z)\log(1-x)$

If $$y=\operatorname{logit}(x)$$ is not a probability but a logit, then the numerically stable version is

$\ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|))$

🏗

TBC.

See Pólya-Gamma.

🏗

## 7 Analog Bits

Fuchi et al. (2023):

The one-hot vector has long been widely used in machine learning as a simple and generic method for representing discrete data. However, this method increases the number of dimensions linearly with the categorical data to be represented, which is problematic from the viewpoint of spatial computational complexity in deep learning, which requires a large amount of data. Recently, Analog Bits , a method for representing discrete data as a sequence of bits, was proposed on the basis of the high expressiveness of diffusion models. However, since the number of category types to be represented in a generation task is not necessarily at a power of two, there is a discrepancy between the range that Analog Bits can represent and the range represented as category data. If such a value is generated, the problem is that the original category value cannot be restored. To address this issue, we propose Residual Bit Vector (ResBit), which is a hierarchical bit representation. Although it is a general-purpose representation method, in this paper, we treat it as numerical data and show that it can be used as an extension of Analog Bits using Table Residual Bit Diffusion (TRBD), which is incorporated into TabDDPM, a tabular data generation method. We experimentally confirmed that TRBD can generate diverse and high-quality data from small-scale table data to table data containing diverse category values faster than TabDDPM. Furthermore, we show that ResBit can also serve as an alternative to the one-hot vector by utilizing ResBit for conditioning in GANs and as a label expression in image classification.

## 8 Hierarchical Multi-class classifiers

Read et al. (2021) discusses how to create multi-class classifiers by stacking layers of binary classifiers and using each as a feature input to the next, which is an elegant solution IMO.

## 9 Philosophical connection to semantics

Since semantics is what humans call classifiers.

## 10 Connection to legibility

I do think there is something interesting happening with legibility. States need to classify, apparently. Adversarial classification is my point of entry into this.

## 11 References

Arya, Schauer, Schäfer, et al. 2022. In.
Baldi, Brunak, Chauvin, et al. 2000. Bioinformatics.
Brodersen, Ong, Stephan, et al. 2010. In Proceedings of the 2010 20th International Conference on Pattern Recognition. ICPR ’10.
Chen, Zhang, and Hinton. 2022. In.
Che, Zhang, Sohl-Dickstein, et al. 2020. arXiv:2003.06060 [Cs, Stat].
Dyrland, Lundervold, and Mana. 2023.
Ferrer. 2023.
Flach, Hernández-Orallo, and Ferri. 2011. In Proceedings of the 28th International Conference on Machine Learning (ICML-11).
Fuchi, Zanashir, Minami, et al. 2023.
Gneiting, and Raftery. 2007. Journal of the American Statistical Association.
Gorodkin. 2004. Computational Biology and Chemistry.
Gozli. 2023. Seeds of Science.
Grathwohl, Swersky, Hashemi, et al. 2021.
Hand. 2009. Machine Learning.
Huang, Li, Macheret, et al. 2020. Journal of the American Medical Informatics Association : JAMIA.
Jung, Hero III, Mara, et al. 2016. arXiv:1612.01414 [Cs, Stat].
Kim, Ramdas, Singh, et al. 2021. The Annals of Statistics.
Lobo, Jiménez-Valverde, and Real. 2008. Global Ecology and Biogeography.
Matthews. 1975. Biochimica Et Biophysica Acta (BBA) - Protein Structure.
Menon, and Williamson. 2016. Journal of Machine Learning Research.
2024. Transactions on Machine Learning Research.
Nock, Menon, and Ong. 2016. arXiv:1607.00360 [Cs, Stat].
Polson, Scott, and Windle. 2013. Journal of the American Statistical Association.
Provost, and Fawcett. 2001. Machine Learning.
Read, Pfahringer, Holmes, et al. 2021. Journal of Artificial Intelligence Research.
Reid, and Williamson. 2011. Journal of Machine Learning Research.
Spackman. 1989. In Proceedings of the Sixth International Workshop on Machine Learning.
Tiao, Bonilla, and Ramos. 2018.
Tiao, Klein, Seeger, et al. 2021. In Proceedings of the 38th International Conference on Machine Learning.
van den Goorbergh, van Smeden, Timmerman, et al. 2022. Journal of the American Medical Informatics Association.