# Hypothesis tests, statistical

August 23, 2014 — July 18, 2023

algebra
Bayes
functional analysis
Hilbert space
how do science
linear algebra
machine learning
model selection
nonparametric
statistics
uncertainty

Informally, statistical tests play two essential roles:

1. To confirm the existence patterns in data that are too faint to see
2. To discourage us from accepting patterns that seem obvious to our monkey minds, but which are not supported by the data

When we hope to get good data from our tests we take on two important responsibilities:

1. invoking some mathematical machinery to make all this precise and quantifiable and as objective as possible given certain assumptions
2. making promises to use tests in a particular way which matches the assumption of the tests

Usually the wheels fall off at #2.

## 1 Teaching

When do we want to teach testing and why?

Daniel Lakens is running an A/B test on teaching A/B tests.

What kind of hypothesis testing do we want? I guess we need to design some kind of tests so we can get real numbers.

In general we are worried about the abuse of this particular tool in experimental practice, which is notoriously fraught. How many degrees of freedom to you give yourself by accident with bad data hygiene?

• Cassie Kozyrkov frames testing in a decision-theoretic context which I am increasingly convinced is the only sane one.

As for which tests to teach… Wilcoxon Mann-Whitney and Kruskal-Wallis tests are neat. Are they simpler than t-testing?

## 2 As decision tool

It is possibly the least sexy method in statistics and as such, usually taught by the least interesting professor in the department, or at least one who couldn’t find an interesting enough excuse to get out of it, which is a strong correlate. Said professor will then teach it to you as if you were in turn the least interesting student in the school, and they will teach it as a mathematical object without connecting it to the process of decision making.

In the classical framing, you think about designing and running experiments and deciding if the data is strong enough to influence your actions. In many statistical courses the tests are taught in spooky isolation from actions (certainly they were in my hypothesis testing class, although I am indebted to Sara van de Geer for correcting that.)

For some introductions to the purpose of statistics dest which does not forget to consider tests in light of what actions they will inform, see the following essays by Cassie Kozyrkov who explains them better than I will.

Quote which gives a concrete example of the decision theory/action context:

If you’re interested in analytics (and not statistics), p-values can be a useful way to summarize your data and iterate on your search. Please don’t interpret them as a statistician would. They don’t mean anything except there’s a pattern in these data. Statisticians and analysts may come to blows if they don’t realize that analytics is about what’s in the data (only!) while statistics is about what’s beyond the data.

## 3 Mechanics of null-hypothesis tests

There are many different hypothesis tests and framings for them. Let us consider classical null tests for now, since they are very common.

There are many elaborations of this approach in the modern world. For example, we examine large numbers of hypotheses at once under multiple testing. It can be considered as part of model selection question, or maybe even made particularly nifty using sparse model selection. Probably the most interesting family of tests are tests of conditional independence, especially multiple version of those.

tl;dr classic statistical tests are linear models where your goal decide if a coefficient should be regarded as non-zero or not. Jonas Kristoffer Lindeløv explains this perspective: Common statistical tests are linear models. FWIW I found that perspective to be a real 💡 moment. Alternate/ generalized (?) take: Most statistical tests are canonical correlation analysis .

Daniel Lakens asks Do You Really Want to Test a Hypothesis?:

Is that all too measured? Want more invective? See Everything Wrong with P-Values Under One Roof . AFAICT this is more about cargo-cult usage of P-Values.

Lucile Lu, Robert Chang and Dmitriy Ryaboy of Twitter have a practical guide to risky testing at scale: Power, minimal detectable effect, and bucket size estimation in A/B tests.

Bob Sturm recommends, Bailey (2008) for discussion of hypothesis testing in terms of linear subspaces.

(side note: the proportional odds model generalises K-W/WMW. Huh.)

## 4 Bayesian

Everything so far has been in a frequentist framing. The entire question of hypothesis testing is more likely to be vacuous in Bayesian settings (although Bayes model selection is a thing). See also Thomas Lumley on a Bayesian t-test which ends up being a kind of bootstrap in an interesting way. Also, actionable, see Yanir Seroussi on Making Bayesian A/B testing more accessible. Will Kurt /Count Bayesie, Bayesian A/B Testing: A Hypothesis Test that Makes Sense

## 5 Tooling

I cannot decide if tea-lang is a passive-aggressive joke or not. It is a compiler for statistical tests.

Tea is a domain specific programming language that automates statistical test selection and execution… Users provide 5 pieces of information:

• the dataset of interest,
• the variables in the dataset they want to analyze,
• the study design (e.g., independent, dependent variables),
• the assumptions they make about the data based on domain knowledge(e.g., a variable is normally distributed), and
• a hypothesis.

Tea then “compiles” these into logical constraints to select valid statistical tests. Tests are considered valid if and only if all the assumptions they make about the data (e.g., normal distribution, equal variance between groups, etc.) hold. Tea then finally executes the valid tests.

But in general, this is all baked into R.

## 6 Goodness-of-fit tests

Also a useful thing to have; the hypothesis here is kind-of more interesting, along the lines of it-is-unlikely-that-the-model-you-propose-contains-this-data. Possibly the same as…

## 7 Distribution testing

…the challenge of big data is that the sizes of the domains of the distributions are immense, resulting in a very large number of samples. Thus, we are left with an unacceptably slow algorithm. The good news is that there has been exciting progress in the development of sublinear, sample algorithmic tools for such problems. In this article we describe two recent results that highlight the main ideas contributing to this progress: The first on testing the similarity of distributions, and the second on estimating the entropy of a distribution. We assume that all of our probability distributions are over a finite domain D of size n, but (unless otherwise noted) we do not assume anything else about the distribution.

To quote and paraphrase the first chapter:

This survey [is meant] as an introduction and detailed overview of some topics in distribution testing, an area of theoretical computer science which falls under the general umbrella of property testing, and sits at the intersection of computational learning, statistical learning and hypothesis testing, information theory, and (depending on whom one asks) the theory of machine learning.

There are several other resources you may want to read about this topic, starting with this short introductory survey by or this other survey by, well, myself. This book differs from the previous ones in that it is (1) more recent, (2) more specific, focusing on a subset of questions and using them as guiding examples, instead of depicting as broad a landscape as possible (but from afar), (3) more detailed, including proofs and derivations, and (4) written with in mind the objective of putting the theoretical computer science, statistics, and information theory viewpoints together. Of course, I cannot promise I succeeded; but that was the intent, and you’ll be the judge of the result.

## 9 References

Bailey. 2008. Design of Comparative Experiments. Cambridge series on statistical and probabilistic mathematics.
Briggs. 2019. In Beyond Traditional Probabilistic Methods in Economics. Studies in Computational Intelligence.
———. 2022. Foundations and Trends® in Communications and Information Theory.
Colbourn, and Dinitz. 2010. Handbook of Combinatorial Designs, Second Edition.
Efron. 2008. The Annals of Applied Statistics.
Gelman. 2018. Personality and Social Psychology Bulletin.
Good, and Good. 1999. Resampling Methods: A Practical Guide to Data Analysis.
Greenland. 1995a. Epidemiology.
———. 1995b. Epidemiology.
Knapp. 1978. Psychological Bulletin.
Kohavi, Longbotham, Sommerfield, et al. 2009. Data Mining and Knowledge Discovery.
Korattikara, Chen, and Welling. 2015. Neural Computation.
Kreinovich, Thach, Trung, et al., eds. 2019. Beyond Traditional Probabilistic Methods in Economics. Studies in Computational Intelligence.
Lavergne, Maistre, and Patilea. 2015. Electronic Journal of Statistics.
Lehmann, and Romano. 2010. Testing statistical hypotheses. Springer texts in statistics.
Lumley, Diehr, Emerson, et al. 2002. Annual Review of Public Health.
Maesono, Moriyama, and Lu. 2016. arXiv:1610.02145 [Math, Stat].
Malevergne, and Sornette. 2003. Quantitative Finance.
McShane, Gal, Gelman, et al. 2019. The American Statistician.
Msaouel. 2022. Cancer Investigation.
Na. 2009. Journal of the Korean Statistical Society.
Ormerod, Stewart, Yu, et al. 2017. arXiv:1710.09146 [Math, Stat].
Paparoditis, and Sapatinas. 2014. arXiv:1409.4317 [Math, Stat].
Rubinfeld. 2012. XRDS: Crossroads, The ACM Magazine for Students.
Sejdinovic, Sriperumbudur, Gretton, et al. 2012. The Annals of Statistics.
Tang, Athreya, Sussman, et al. 2014. arXiv:1409.2344 [Math, Stat].
van de Geer. 2016. Estimation and Testing Under Sparsity. Lecture Notes in Mathematics.