# Maximum likelihood inference

M-estimation based on maximising the empirical likelihood with respect to the model by choosing the appropriate parameters appropriatedly.

See also expectation maximisation, information criteria, robust statistics, decision theory, all of machine learning, optimisation etc.

One intuitively natural way of choosing the “best” parameter values for a model based on the data you have. It is prized for various nice properties, especially in the asymptotic limit, and especially, especially for exponential families. It produces, as side-products, some good asymptotic hypothesis tests and some model comparison statistics, most notably the Akaike Information Criterion.

It has rather fewer nice properties for small samples sizes, but is still regarded as a respectable default choice.

This is an extremum estimator with objective (i.e. negative loss) function

${\hat \ell }(\theta |x)={\frac 1n}\sum _{{i=1}}^{n}\ln f(x_{i}|\theta ),$

which is motivated as being the sample estimate of the expected log-likelihood

$\ell (\theta )=\operatorname {E}_{\theta_0}[\,\ln f(x_{i}|\theta )\,]$

for true and unknown parameter value $$\theta_0$$.

Why we choose this particular loss function is a whole other question, or, rather, a whole other field of research. Others are possible, but this one is a nice start.

## Fisher Information

Used in ML theory and kinda-sorta in robust estimation A matrix that tells you how much a new datum affects your parameter estimates. See large sample theory.

## Fun features with exponential families

🏗

### Conditional transformation models

What are they? (Hothorn, Kneib, and Bühlmann 2014; Hothorn, Möst, and Bühlmann 2015).

### the method of sieves

Nonparametrics and maximum likelihood? (Geman and Hwang 1982):

Maximum likelihood estimation often fails when the parameter takes values in an infinite dimensional space. For example, the maximum likelihood method cannot be applied to the completely nonparametric estimation of a density function from an iid sample; the maximum of the likelihood is not attained by any density. In this example, as in many other examples, the parameter space (positive functions with area one) is too big. But the likelihood method can often be salvaged if we first maximize over a constrained subspace of the parameter space and then relax the constraint as the sample size grows. This is Grenander’s “method of sieves.” Application of the method sometimes leads to new estimators for familiar problems, or to a new motivation for an already well-studied technique.

### Variants

Wherein we resolve lexical confusion using brute-force clarity.

What is the difference between a partial likelihood, profile likelihood and marginal likelihood?

## Conditional likelihood

You have incidental nuisance parameters? If you can find a sufficient statistic for them and then condition upon it, they vanish.

## Marginal likelihood

“the marginal probability of the data given the model, with marginalization performed over unobserved variables”

The version that crops up in Bayesian inference. And elsewhere? Need to make this bit precise.

TBD

## Partial likelihood

What’s that? I will start by mangling an introduction from the internet (Where?)

Let $$Y_i$$ denote the observed time (either censoring time or event time) for subject $$i$$, and let $$C_i$$ be the indicator that the time corresponds to an event (i.e. if $$C_i=1$$ the event occurred and if $$C_i=0$$ the time is a censoring time). The hazard function for the Cox proportional hazard model has the form

$\lambda(t|X) = \lambda_0(t)\exp(\beta_1 X_1 + \cdots + \beta_p X_p) = \lambda_0(t)\exp(X \beta^\prime).$

This expression gives the hazard at time $$t$$ for an individual with covariate vector (explanatory variables) $$X$$. Based on this hazard function, a partial likelihood can be constructed from the datasets as

$L(\beta) = \prod_{i:C_i=1}\frac{\theta_i}{\sum_{j:Y_j\ge Y_i}\theta_j},$

where $$θ_j=\exp(X_j\beta^\prime)$$ and $$X_1, …, X_n$$ are the covariate vectors for the $$n$$ independently sampled individuals in the dataset (treated here as column vectors).

The corresponding log partial likelihood is

$\ell(\beta) = \sum_{i:C_i=1} \left(X_i \beta^\prime - \log \sum_{j:Y_j\ge Y_i}\theta_j\right).$

This function can be maximized over $$\beta$$ to produce maximum partial likelihood estimates of the model parameters.

The partial score is

$\ell^\prime(\beta) = \sum_{i:C_i=1} \left(X_i - \frac{\sum_{j:Y_j\ge Y_i}\theta_j X_j}{\sum_{j:Y_j\ge Y_i}\theta_j}\right),$

and the Hessian of the partial log likelihood is

$\ell^{\prime\prime}(\beta) = -\sum_{i:C_i=1} \left(\frac{\sum_{j:Y_j\ge Y_i}\theta_jX_jX_j^\prime}{\sum_{j:Y_j\ge Y_i}\theta_j} - \frac{\sum_{j:Y_j\ge Y_i}\theta_jX_j\times \sum_{j:Y_j\ge Y_i}\theta_jX_j^\prime}{[\sum_{j:Y_j\ge Y_i}\theta_j]^2}\right).$

Using this score function and Hessian matrix, the partial likelihood can be maximized in the usual fashion. The inverse of the Hessian matrix, evaluated at the estimate of $$\beta$$, can be used as an approximate variance-covariance matrix for the estimate, also in the usual fashion.

## Pseudo-likelihood

Dunno. As seen in spatial point processes and other undirected random fields.

Originally Besag (1975, 1977) defined the pseudolikelihood of a finite set of random variables X1, . . . , Xn as the product of the conditional likelihoods of each Xi given the other variables {X j , j = i}. This was extended (Besag, 1977; Besag et al., 1982) to point processes, for which it can be viewed as an infinite product of infinitesimal conditional probabilities.

## Quasi-likelihood

The casual explanation I got was that this is somewhat like maximum likelihood inference, but based solely upon the means and variances of the parameters in question oh and p.s. if you have over-dispersed data for a Poisson regression this will help you.

AFAICT this is exclusively relevant to generalised linear models.

## H-likelihood

…is some kind of extension to quasi-likelihood, for hierarchical generalised linear models.

Arnold, Barry C., and David Strauss. 1991. “Pseudolikelihood Estimation: Some Examples.” Sankhyā: The Indian Journal of Statistics, Series B (1960-2002) 53 (2): 233–43. http://www.jstor.org/stable/25052695.

Baddeley, Adrian, and Rolf Turner. 2000. “Practical Maximum Pseudolikelihood for Spatial Point Patterns.” Australian & New Zealand Journal of Statistics 42 (3): 283–322. https://doi.org/10.1111/1467-842X.00128.

Berman, Mark, and T. Rolf Turner. 1992. “Approximating Point Process Likelihoods with GLIM.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 41 (1): 31–38. https://doi.org/10.2307/2347614.

Bertl, Johanna, Gregory Ewing, Carolin Kosiol, and Andreas Futschik. 2015. “Approximate Maximum Likelihood Estimation,” July. http://arxiv.org/abs/1507.04553.

Besag, Julian. 1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal Statistical Society. Series B (Methodological) 36 (2): 192–236. https://doi.org/10.1111/j.2517-6161.1974.tb00999.x.

———. 1975. “Statistical Analysis of Non-Lattice Data.” Journal of the Royal Statistical Society. Series D (the Statistician) 24 (3): 179–95. https://doi.org/10.2307/2987782.

———. 1977. “Efficiency of Pseudolikelihood Estimation for Simple Gaussian Fields.” Biometrika 64 (3): 616–18. https://doi.org/10.2307/2345341.

Cox, D. R. 1975. “Partial Likelihood.” Biometrika 62 (2): 269–76. https://doi.org/10.1093/biomet/62.2.269.

Cox, D. R., and N. Reid. 2004. “A Note on Pseudolikelihood Constructed from Marginal Densities.” Biometrika 91 (3): 729–37. https://doi.org/10.1093/biomet/91.3.729.

Efron, Bradley. 1986. “How Biased Is the Apparent Error Rate of a Prediction Rule?” Journal of the American Statistical Association 81 (394): 461–70. https://doi.org/10.1080/01621459.1986.10478291.

Efron, Bradley, and David V. Hinkley. 1978. “Assessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected Fisher Information.” Biometrika 65 (3): 457–83. https://doi.org/10.1093/biomet/65.3.457.

Flammia, Steven T., David Gross, Yi-Kai Liu, and Jens Eisert. 2012. “Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics 14 (9): 095022. https://doi.org/10.1088/1367-2630/14/9/095022.

Geman, Stuart, and Chii-Ruey Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” The Annals of Statistics 10 (2): 401–14. https://doi.org/10.1214/aos/1176345782.

Geyer, Charles J. 1991. “Markov Chain Monte Carlo Maximum Likelihood.” http://conservancy.umn.edu.sci-hub.org/handle/11299/58440.

Gong, Gail, and Francisco J. Samaniego. 1981. “Pseudo Maximum Likelihood Estimation: Theory and Applications.” The Annals of Statistics 9 (4): 861–69. http://www.jstor.org/stable/2240854.

Goulard, Michel, Aila Särkkä, and Pavel Grabarnik. 1996. “Parameter Estimation for Marked Gibbs Point Processes Through the Maximum Pseudo-Likelihood Method.” Scandinavian Journal of Statistics, 365–79. http://www.jstor.org/stable/4616410.

Heyde, C. C. 1997. Quasi-Likelihood and Its Application a General Approach to Optimal Parameter Estimation. New York: Springer. http://site.ebrary.com/id/10015678.

Hothorn, Torsten, Thomas Kneib, and Peter Bühlmann. 2014. “Conditional Transformation Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 3–27. https://doi.org/10.1111/rssb.12017.

Hothorn, Torsten, Lisa Möst, and Peter Bühlmann. 2015. “Most Likely Transformations,” August. http://arxiv.org/abs/1508.06749.

Hu, Feifang, and James V. Zidek. 2002. “The Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3, 3): 347–71. https://doi.org/10.2307/3316141.

Huang, Fuchun, and Yosihiko Ogata. 1999. “Improvements of the Maximum Pseudo-Likelihood Estimators in Various Spatial Statistical Models.” Journal of Computational and Graphical Statistics 8 (3): 510–30. https://doi.org/10.1080/10618600.1999.10474829.

Introduction to Variance Estimation. 2007. Statistics for Social and Behavioral Sciences. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-35099-8.

Janková, Jana, and Sara van de Geer. 2015. “Honest Confidence Regions and Optimality in High-Dimensional Precision Matrix Estimation,” July. http://arxiv.org/abs/1507.02061.

Jensen, Jens Ledet, and Hans R. Künsch. 1994. “On Asymptotic Normality of Pseudo Likelihood Estimates for Pairwise Interaction Processes.” Annals of the Institute of Statistical Mathematics 46 (3): 475–86. https://doi.org/10.1007/BF00773511.

Jensen, Jens Ledet, and Jesper Møller. 1991. “Pseudolikelihood for Exponential Family Models of Spatial Point Processes.” The Annals of Applied Probability 1 (3): 445–61. https://doi.org/10.1214/aoap/1177005877.

Kasy, Maximilian. 2015. “Uniformity and the Delta Method,” July. http://arxiv.org/abs/1507.05731.

Millar, Russell B. 2011. Maximum Likelihood Estimation and Inference: With Examples in R, SAS and ADMB. Statistics in Practice. Chichester, UK: John Wiley & Sons, Ltd. http://doi.wiley.com/10.1002/9780470094846.

Ollinger, J. M. 1990. “Iterative Reconstruction-Reprojection and the Expectation-Maximization Algorithm.” IEEE Transactions on Medical Imaging 9 (1): 94–98. https://doi.org/10.1109/42.52986.

Raue, A., C. Kreutz, T. Maiwald, J. Bachmann, M. Schilling, U. Klingmüller, and J. Timmer. 2009. “Structural and Practical Identifiability Analysis of Partially Observed Dynamical Models by Exploiting the Profile Likelihood.” Bioinformatics 25 (15): 1923–9. https://doi.org/10.1093/bioinformatics/btp358.

Strauss, David, and Michael Ikeda. 1990. “Pseudolikelihood Estimation for Social Networks.” Journal of the American Statistical Association 85 (409): 204–12. http://www.stat.cmu.edu/~fienberg/Stat36-835/FrankIkeda-JASA-1980.pdf.

Sundberg, Rolf. 1976. “An Iterative Method for Solution of the Likelihood Equations for Incomplete Data from Exponential Families.” Communications in Statistics - Simulation and Computation 5 (1): 55–64. https://doi.org/10.1080/03610917608812007.

Tibshirani, Ryan J., Alessandro Rinaldo, Robert Tibshirani, and Larry Wasserman. 2015. “Uniform Asymptotic Inference and the Bootstrap After Model Selection,” June. http://arxiv.org/abs/1506.06266.

Vanlier, J., C. A. Tiemann, P. a. J. Hilbers, and N. A. W. van Riel. 2012. “An Integrated Strategy for Prediction Uncertainty Analysis.” Bioinformatics 28 (8): 1130–5. https://doi.org/10.1093/bioinformatics/bts088.

Varin, Cristiano. 2008. “On Composite Marginal Likelihoods.” Advances in Statistical Analysis 92 (1): 1–28. https://doi.org/10.1007/s10182-008-0060-7.

Varin, Cristiano, Nancy Reid, and David Firth. 2011. “An Overview of Composite Likelihood Methods.” Statistica Sinica 21 (1): 5–42. http://www3.stat.sinica.edu.tw/statistica/J21N1/j21n11/j21n11.html.

Wang, Steven Xiaogang. 2001. “Maximum Weighted Likelihood Estimation.” Retrospective Theses and Dissertations, 1919-2007. https://doi.org/10.14288/1.0090880.

Wedderburn, R. W. M. 1974. “Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss—Newton Method.” Biometrika 61 (3): 439–47. https://doi.org/10.1093/biomet/61.3.439.

Wolter, Kirk M. 2007. “Taylor Series Methods.” In Introduction to Variance Estimation, edited by Kirk M. Wolter, 226–71. Statistics for Social and Behavioral Sciences. New York, NY: Springer. https://doi.org/10.1007/978-0-387-35099-8_6.