Tests, statistical

Maybe also design of experiments while we are here?

The mathematics of the last century worth of experiment design. This is about the classical framing, where you think about designing and running experiments and deciding if a thing is true or not, then go home. There are many elaborations and complications to this approach in the modern world. For example, we examine large numbers of hypothesese at once under multiple testing. It can be considered as part of model selection question, or maybe even made particulary nifty using sparse model selection. Porbably the most interesting family of tests are tests of conditional independence.

But first, let us examine the hoary old history of the small-scale version of this. Probably the least sexy thing in statistics and as such, usually taught by the least interesting professor in the department, or at least one who couldn’t find an interesting enough excuse to get out of it, which is a fair indication. Said professor will then teach it to you as if you were in turn the least interesting student in the school, and so it goes on.

Anyhow, it turns out there are an powerful tools within this idea. be elegant and powerful tool if you can move past block- and combinatorial design stamp collecting. Which few classes do, because it is the easiest way to fill in those long lecture hours. Even if they do move beyond that, they get caught up in a loooong discussion about what P-values mean and whether they prove things.

tl;dr classic statistical tests are linear regressions where your goal decide if a coefficient should be regarded as non-zero or not. Jonas Kristoffer Lindeløv’s explains this perspective: Common statistical tests are linear models.

Daniel Lakens asks Do You Really Want to Test a Hypothesis?

The lecture “Do You Really Want to Test a Hypothesis?” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you can do. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it is p-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect from p > 0.05, I explain how to use equivalence testing­­. Instead of telling you p-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.

Lucile Lu, Robert Chang and Dmitriy Ryaboy of Twitter have a practical guide to risky testing at scale: Power, minimal detectable effect, and bucket size estimation in A/B tests

Bob Sturm recommends, Bailey (2008) for discussion of hypothesis testing in terms of linear subspaces.

Everything so far has been in a frequentist framing. The entire question of what hypothesist testing is much more likely to be be vacuous in Bayesian settings (although Bayes model selection is a thing). See also Thomas Lumley on a Bayesian t-test which ends up being a kind of bootstrap in a very elegant way.

I cannot decide if tea-lang is a passive-aggressive joke or not. It is a compiler for statistical tests.

Tea is a domain specific programming language that automates statistical test selection and execution… Users provide 5 pieces of information:

  • the dataset of interest,
  • the variables in the dataset they want to analyze,
  • the study design (e.g., independent, dependent variables),
  • the assumptions they make about the data based on domain knowledge(e.g., a variable is normally distributed), and
  • a hypothesis.

Tea then "compiles" these into logical constraints to select valid statistical tests. Tests are considered valid if and only if all the assumptions they make about the data (e.g., normal distribution, equal variance between groups, etc.) hold. Tea then finally executes the valid tests.

Goodness-of-fit tests

Also a useful thing to have; the hypothesis here is kind-of more interesting, along the lines of it-is-unlikely-that-the-model-you-propose-contains-this-data.

Design of experiments


Bailey, R. A. 2008. Design of Comparative Experiments. 1 edition. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge; New York: Cambridge University Press.

Colbourn, Charles J., and Jeffrey H. Dinitz. 2010. Handbook of Combinatorial Designs, Second Edition. CRC Press.

Efron, Bradley. 2008. “Simultaneous Inference: When Should Hypothesis Testing Problems Be Combined?” The Annals of Applied Statistics 2 (1): 197–223. https://doi.org/10.1214/07-AOAS141.

Geer, Sara van de. 2016. Estimation and Testing Under Sparsity. Vol. 2159. Lecture Notes in Mathematics. Cham: Springer International Publishing. http://link.springer.com/10.1007/978-3-319-32774-7.

Kohavi, Ron, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. “Controlled Experiments on the Web: Survey and Practical Guide.” Data Mining and Knowledge Discovery 18 (1): 140–81. https://doi.org/10.1007/s10618-008-0114-1.

Korattikara, Anoop, Yutian Chen, and Max Welling. 2015. “Sequential Tests for Large-Scale Learning.” Neural Computation 28 (1): 45–70. https://doi.org/10.1162/NECO_a_00796.

Lavergne, Pascal, Samuel Maistre, and Valentin Patilea. 2015. “A Significance Test for Covariates in Nonparametric Regression.” Electronic Journal of Statistics 9: 643–78. https://doi.org/10.1214/15-EJS1005.

Lehmann, Erich L., and Joseph P. Romano. 2010. Testing Statistical Hypotheses. 3. ed. Springer Texts in Statistics. New York, NY: Springer.

Maesono, Yoshihiko, Taku Moriyama, and Mengxin Lu. 2016. “Smoothed Nonparametric Tests and Their Properties,” October. http://arxiv.org/abs/1610.02145.

Malevergne, Yannick, and Didier Sornette. 2003. “Testing the Gaussian Copula Hypothesis for Financial Assets Dependences.” Quantitative Finance 3 (4): 231–50. https://doi.org/10.1088/1469-7688/3/4/301.

McShane, Blakeley B., David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. “Abandon Statistical Significance.” The American Statistician 73 (sup1): 235–45. https://doi.org/10.1080/00031305.2018.1527253.

Na, Seongryong. 2009. “Goodness-of-Fit Test Using Residuals in Infinite-Order Autoregressive Models.” Journal of the Korean Statistical Society 38 (3): 287–95. https://doi.org/10.1016/j.jkss.2008.12.002.

Ormerod, John T., Michael Stewart, Weichang Yu, and Sarah E. Romanes. 2017. “Bayesian Hypothesis Tests with Diffuse Priors: Can We Have Our Cake and Eat It Too?” October. http://arxiv.org/abs/1710.09146.

Paparoditis, Efstathios, and Theofanis Sapatinas. 2014. “Bootstrap-Based Testing for Functional Data,” September. http://arxiv.org/abs/1409.4317.

Sejdinovic, Dino, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. 2012. “Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing.” The Annals of Statistics 41 (5): 2263–91. https://doi.org/10.1214/13-AOS1140.

Tang, Minh, Avanti Athreya, Daniel L. Sussman, Vince Lyzinski, and Carey E. Priebe. 2014. “A Nonparametric Two-Sample Hypothesis Testing Problem for Random Dot Product Graphs,” September. http://arxiv.org/abs/1409.2344.