Labelling losses, fitting classifiers etc


Precision/Recall and f-scores all work for multi-label classification, although they have bad qualities in unbalanced classes.

Unbalanced class problems


Metric Zoo

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions; this is surprisingly hard to find on, e.g. the documentation for deep learning toolkits, in keeping with the field’s general taste for magical black boxes. The Pirates guide to various scores provides an easy introduction.

Matthews correlation coefficient

Due to Matthews (Matthews 1975) This is the first choice for seamlessly handling multi-label problems, since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a vastly different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.

2-class case

Take your \(2 \times 2\). confusion matrix of true positive, false positives etc.

\[ {\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}} \]

\[ |{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}} \]

Multiclass case

Take your \(K \times K\) confusion matrix \(C\), then

\[ {\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}} \]


Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in the mid-century. (Matthews 1975) talk about the AUC for radiology; Supposedly (Matthews 1975)introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.

Cross entropy

I’d better write down form for this, since most ML toolkits are curiously shy about it.

Let \(x\) be the estimated probability and \(z\) be the supervised class label. Then the binary cross entropy loss is

\[ \ell(x,z) = -z\log(x) - (1-z)\log(1-x) \]

If \(y=\operatorname{logit}(x)\) is not a probability but a logit, then the numerically stable version is

\[ \ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|)) \]

f-measure et al


Philosophical connection to semantics

Since semantics is what humans call classifiers.

Brehmer, Johann, Kyle Cranmer, Siddharth Mishra-Sharma, Felix Kling, and Gilles Louppe. 2019. “Mining Gold: Improving Simulation-Based Inference with Latent Information.” In, 7.

Cranmer, Miles D, Rui Xu, Peter Battaglia, and Shirley Ho. 2019. “Learning Symbolic Physics with Graph Networks.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 6.

Flach, Peter, José Hernández-Orallo, and Cesar Ferri. 2011. “A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 657–64.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.

Gorodkin, J. 2004. “Comparing Two K-Category Assignments by a K-Category Correlation Coefficient.” Computational Biology and Chemistry 28 (5-6): 367–74.

Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23.

Hanley, J A, and B J McNeil. 1983. “A Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived from the Same Cases.” Radiology 148 (3): 839–43.

Jung, Alexander, Alfred O. Hero III, Alexandru Mara, and Saeed Jahromi. 2016. “Semi-Supervised Learning via Sparse Label Propagation,” December.

Kasim, Muhammad, J Topp-Mugglestone, P Hatfield, D H Froula, G Gregori, M Jarvis, E Viezzer, and Sam Vinko. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In, 5.

Lobo, Jorge M., Alberto Jiménez-Valverde, and Raimundo Real. 2008. “AUC: A Misleading Measure of the Performance of Predictive Distribution Models.” Global Ecology and Biogeography 17 (2): 145–51.

Lu, Lu, Zhiping Mao, and Xuhui Meng. 2019. “DeepXDE: A Deep Learning Library for Solving Differential Equations.” In, 6.

Matthews, B. W. 1975. “Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.” Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442–51.

Menon, Aditya Krishna, and Robert C. Williamson. 2016. “Bipartite Ranking: A Risk-Theoretic Perspective.” Journal of Machine Learning Research 17 (195): 1–102.

Nock, Richard, Aditya Krishna Menon, and Cheng Soon Ong. 2016. “A Scaled Bregman Theorem with Applications,” July.

Powers, David Martin. 2007. “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.”

Reid, Mark D., and Robert C. Williamson. 2011. “Information, Divergence and Risk for Binary Experiments.” Journal of Machine Learning Research 12 (Mar): 731–817.

Rezende, Danilo J, Sébastien Racanière, Irina Higgins, and Peter Toth. 2019. “Equivariant Hamiltonian Flows.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 6.

Sarkar, Soumalya, and Michael Joly. 2019. “Multi-Fidelity Learning with Heterogeneous Domains.” In, 5.

Spackman, Kent A. 1989. “Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning.” In Proceedings of the Sixth International Workshop on Machine Learning, 160–63. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.