## As decision tool

It is possibly the least sexy method in statistics and as such, usually taught by the least interesting professor in the department, or at least one who couldn’t find an interesting enough excuse to get out of it, which is a strong correlate. Said professor will then teach it to you as if you were in turn the least interesting student in the school, and they will teach it as a mathematical object without connecting it to the process of decision making.

In the classical framing, you think about designing and running experiments and deciding if the data is strong enough to influence your actions. In many statistical courses the tests are taught in spooky isolation from actions (certainly they were in my hypothesis testing class, although I am indebted to Sara van de Geer for correcting that.)

For some introductions to the purpose of statistics dest which does not forget to consider tests in light of what actions they will inform, see the following essays by Cassie Kozyrkov who explains them better than I will.

- Never start with a hypothesis.
- A trick question for data science buffs
- Why are p-values like needles? It’s dangerous to share them!

Quote which gives a concrete example of the decision theory/action context:

If you're interested in analytics (and not statistics), p-values can be a useful way to summarize your data and iterate on your search. Please don’t interpret them as a statistician would. They don’t

meananything except there’s a pattern inthesedata. Statisticians and analysts may come to blows if they don’t realize that analytics is about what’s in the data (only!) while statistics is about what’s beyond the data.

## Mechanics of null-hypothesis tests

There are many different hypothesis tests and framings for them. Let us consider classical null tests for now, since they are very common.

There are many elaborations of this approach in the modern world. For example, we examine large numbers of hypotheses at once under multiple testing. It can be considered as part of model selection question, or maybe even made particularly nifty using sparse model selection. Probably the most interesting family of tests are tests of conditional independence, especially multiple version of those.

**tl;dr** classic statistical tests are linear models where your goal decide if a coefficient should be regarded as non-zero or not.
Jonas Kristoffer Lindeløv explains this perspective:
Common statistical tests are linear models.
FWIW I found that perspective to be a real 💡 moment.

Daniel Lakens asks Do You Really Want to Test a Hypothesis?:

The lecture “Do You Really Want to Test a Hypothesis?” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you

cando. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it isp-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect fromp> 0.05, I explain how to use equivalence testing. Instead of telling youp-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.

Is that all too measured? Want more invective?
See *Everything Wrong with P-Values Under One Roof* (Briggs 2019).
AFAICT this is more about conventional usage of P-Values.

Lucile Lu, Robert Chang and Dmitriy Ryaboy of Twitter have a practical guide to risky testing at scale: Power, minimal detectable effect, and bucket size estimation in A/B tests

Bob Sturm recommends, Bailey (2008) for discussion of hypothesis testing in terms of linear subspaces.

(side note: the proportional odds model generalises K-W/WMW. Huh.)

- Multiplicitous – Put A Number On It!
- Experiment power calculator tells you how many data points you need to have and thus whether it is likely you can (dis)prove the thing with the budget you have.

## Bayesian

Everything so far has been in a frequentist framing.
The entire question of hypothesis testing is more likely to be vacuous in Bayesian settings (although Bayes model selection is a thing).
See also Thomas Lumley on a Bayesian *t*-test which ends up being a kind of bootstrap in an interesting way.
Also, actionable, see Yanir Seroussi on Making Bayesian A/B testing more accessible.

## Tooling

I cannot decide if tea-lang is a passive-aggressive joke or not. It is a compiler for statistical tests.

Tea is a domain specific programming language that automates statistical test selection and execution… Users provide 5 pieces of information:

- the dataset of interest,
- the variables in the dataset they want to analyze,
- the study design (e.g., independent, dependent variables),
- the assumptions they make about the data based on domain knowledge(e.g., a variable is normally distributed), and
- a hypothesis.
Tea then “compiles” these into logical constraints to select valid statistical tests. Tests are considered valid if and only if all the assumptions they make about the data (e.g., normal distribution, equal variance between groups, etc.) hold. Tea then finally executes the valid tests.

But in general, this is all baked into R.

## Goodness-of-fit tests

Also a useful thing to have; the hypothesis here is kind-of more interesting, along the lines of it-is-unlikely-that-the-model-you-propose-contains-this-data.

## References

*Design of Comparative Experiments*. 1 edition. Cambridge series on statistical and probabilistic mathematics. Cambridge; New York: Cambridge University Press.

*Beyond Traditional Probabilistic Methods in Economics*, edited by Vladik Kreinovich, Nguyen Ngoc Thach, Nguyen Duc Trung, and Dang Van Thanh. Vol. 809. Studies in Computational Intelligence. Cham: Springer International Publishing.

*Handbook of Combinatorial Designs, Second Edition*. CRC Press.

*The Annals of Applied Statistics*2 (1): 197–223.

*Estimation and Testing Under Sparsity*. Vol. 2159. Lecture Notes in Mathematics. Cham: Springer International Publishing.

*Resampling Methods: A Practical Guide to Data Analysis*. Birkhäuser Basel.

*Epidemiology*6 (4): 356–65.

*Epidemiology*6 (5): 563–65.

*Data Mining and Knowledge Discovery*18 (1): 140–81.

*Neural Computation*28 (1): 45–70.

*Beyond Traditional Probabilistic Methods in Economics*. Vol. 809. Studies in Computational Intelligence. Cham: Springer International Publishing.

*Electronic Journal of Statistics*9: 643–78.

*Testing statistical hypotheses*. 3. ed. Springer texts in statistics. New York, NY: Springer.

*Annual Review of Public Health*23 (1): 151–69.

*arXiv:1610.02145 [Math, Stat]*, October.

*Quantitative Finance*3 (4): 231–50.

*The American Statistician*73 (sup1): 235–45.

*Journal of the Korean Statistical Society*38 (3): 287–95.

*arXiv:1710.09146 [Math, Stat]*, October.

*arXiv:1409.4317 [Math, Stat]*, September.

*The Annals of Statistics*41 (5): 2263–91.

*arXiv:1409.2344 [Math, Stat]*, September.

## No comments yet. Why not leave one?