Maximum Mean Discrepancy

An Integral probability metric. The intersection of reproducing kernel methods, dependence tests and probability metrics; where we use an kernel embedding to cleverly measure differences between probability distributions, typically an RKHS embedding.

Can be estimated from samples only, which is neat.

A mere placeholder. For a thorough treatment see the canonical references (Gretton et al. 2008; Gretton, Borgwardt, et al. 2012).


Arthur Gretton, Dougal Sutherland, Wittawat Jitkrittum presentation: Interpretable Comparison of Distributions and Models.

Danica Sutherland’s explanation is IMO magnificent.

Pierre Alquier’s post Universal estimation with Maximum Mean Discrepancy (MMD) shows how to use MMD in a robust nonparametric estimator.

Gaël Varoquaux’ introduction is friendly and illustrated, Comparing distributions: Kernels estimate good representations, l1 distances give good tests based on (Scetbon and Varoquaux 2019).

Connection to Optimal transport losses

Husain (2020)’s results connect IPMs to transport metrics and regularisation theory, and classification.

Feydy et al. (2019) connects MMD to optimal transport losses.

Arbel et al. (2019) also looks pertinent and has some connections to Wasserstein gradient flows.

Choice of kernel

Hmm. See Gretton, Sriperumbudur, et al. (2012).


MMD is included in the ITE toolbox (estimators).


The GeomLoss library provides efficient GPU implementations for:

It is hosted on GitHub and distributed under the permissive MIT license.
pypi pepy

GeomLoss functions are available through the custom PyTorch layers SamplesLoss, ImagesLoss and VolumesLoss which allow you to work with weighted point clouds (of any dimension), density maps and volumetric segmentation masks.


Arbel, Michael, Anna Korba, Adil Salim, and Arthur Gretton. 2019. Maximum Mean Discrepancy Gradient Flow.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 32:6484–94. Red Hook, NY, USA: Curran Associates Inc.
Arras, Benjamin, Ehsan Azmoodeh, Guillaume Poly, and Yvik Swan. 2017. A Bound on the 2-Wasserstein Distance Between Linear Combinations of Independent Random Variables.” arXiv:1704.01376 [Math], April.
Blanchet, Jose, Lin Chen, and Xun Yu Zhou. 2018. Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.” arXiv:1802.04885 [Stat], February.
Dellaporta, Charita, Jeremias Knoblauch, Theodoros Damoulas, and François-Xavier Briol. 2022. Robust Bayesian Inference for Simulator-Based Models via the MMD Posterior Bootstrap.” arXiv:2202.04744 [Cs, Stat], February.
Feydy, Jean, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. 2019. Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2681–90. PMLR.
Gretton, Arthur, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A Kernel Two-Sample Test.” The Journal of Machine Learning Research 13 (1): 723–73.
Gretton, Arthur, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J Smola. 2008. A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press.
Gretton, Arthur, Bharath Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, and Kenji Fukumizu. 2012. “Optimal Kernel Choice for Large-Scale Two-Sample Tests.” In Proceedings of the 25th International Conference on Neural Information Processing Systems, 1205–13. NIPS’12. Red Hook, NY, USA: Curran Associates Inc.
Hamzi, Boumediene, and Houman Owhadi. 2021. Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part I: Parametric Kernel Flows.” Physica D: Nonlinear Phenomena 421 (July): 132817.
Husain, Hisham. 2020. Distributional Robustness with IPMs and Links to Regularization and GANs.” arXiv:2006.04349 [Cs, Stat], June.
Huszár, Ferenc, and David Duvenaud. 2016. Optimally-Weighted Herding Is Bayesian Quadrature.” arXiv.
Jitkrittum, Wittawat, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, and Arthur Gretton. 2017. A Linear-Time Kernel Goodness-of-Fit Test.” In Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc.
Long, Mingsheng, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks.” In Proceedings of the 32nd International Conference on Machine Learning, 97–105. PMLR.
Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Schölkopf. 2014. Kernel Mean Shrinkage Estimators.” arXiv:1405.5505 [Cs, Stat], May.
Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. 2017. Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends® in Machine Learning 10 (1-2): 1–141.
Nishiyama, Yu, and Kenji Fukumizu. 2016. Characteristic Kernels and Infinitely Divisible Distributions.” The Journal of Machine Learning Research 17 (1): 6240–67.
Pfister, Niklas, Peter Bühlmann, Bernhard Schölkopf, and Jonas Peters. 2018. Kernel-Based Tests for Joint Independence.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (1): 5–31.
Rustamov, Raif M. 2021. Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat 10 (1): e329.
Scetbon, Meyer, and Gael Varoquaux. 2019. Comparing Distributions: \(\ell_1\) Geometry Improves Kernel Two-Sample Testing.” In Advances in Neural Information Processing Systems 32, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, 12306–16. Curran Associates, Inc.
Schölkopf, Bernhard, Krikamol Muandet, Kenji Fukumizu, and Jonas Peters. 2015. Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.” arXiv:1501.06794 [Cs, Stat], January.
Sejdinovic, Dino, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. 2012. Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing.” The Annals of Statistics 41 (5): 2263–91.
Smola, Alex, Arthur Gretton, Le Song, and Bernhard Schölkopf. 2007. A Hilbert Space Embedding for Distributions.” In Algorithmic Learning Theory, edited by Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, 13–31. Lecture Notes in Computer Science 4754. Springer Berlin Heidelberg.
Song, Le, Kenji Fukumizu, and Arthur Gretton. 2013. Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models.” IEEE Signal Processing Magazine 30 (4): 98–111.
Song, Le, Jonathan Huang, Alex Smola, and Kenji Fukumizu. 2009. Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems.” In Proceedings of the 26th Annual International Conference on Machine Learning, 961–68. ICML ’09. New York, NY, USA: ACM.
Sriperumbudur, B. K., A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf. 2008. Injective Hilbert Space Embeddings of Probability Measures.” In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
Sriperumbudur, Bharath K., Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2012. On the Empirical Estimation of Integral Probability Metrics.” Electronic Journal of Statistics 6: 1550–99.
Sriperumbudur, Bharath K., Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2010. Hilbert Space Embeddings and Metrics on Probability Measures.” Journal of Machine Learning Research 11 (April): 1517−1561.
Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. 2017. Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery.” arXiv:1702.03877 [Stat], February.
Sutherland, Danica J., Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. 2021. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy.” arXiv.
Szabo, Zoltan, and Bharath K. Sriperumbudur. 2017. Characteristic and Universal Tensor Product Kernels.” arXiv:1708.08157 [Cs, Math, Stat], August.
Tolstikhin, Ilya O, Bharath K. Sriperumbudur, and Bernhard Schölkopf. 2016. Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1930–38. Curran Associates, Inc.
Zhang, Kun, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2012. Kernel-Based Conditional Independence Test and Application in Causal Discovery.” arXiv:1202.3775 [Cs, Stat], February.
Zhang, Qinyi, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. 2016. Large-Scale Kernel Methods for Independence Testing.” arXiv:1606.07892 [Stat], June.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.