nonparametric on Dan MacKinlayhttps://danmackinlay.name/tags/nonparametric.htmlRecent content in nonparametric on Dan MacKinlayHugo -- gohugo.ioen-usTue, 21 Dec 2021 11:23:30 +1100Reparameterization tricks in inferencehttps://danmackinlay.name/notebook/reparameterization_trick.htmlTue, 21 Dec 2021 11:23:30 +1100https://danmackinlay.name/notebook/reparameterization_trick.htmlTutorials Normalizing flows “Normalized” flows For density estimation Representational power of References Approximating the desired distribution by perturbation of the available distribution
A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables such that I can differentiate through the parameters of a distribution”, usually by MC. Storchastic credits this to (Glasserman and Ho 1991) as perturbation analysis.t-processeshttps://danmackinlay.name/notebook/t_process.htmlWed, 24 Nov 2021 14:01:28 +1100https://danmackinlay.name/notebook/t_process.htmlt-processes regression Markov t-process References Stochastic processes with Student-t marginals. Much as Student-\(t\) distributions generalise Gaussian distributions, \(t\)-processes generalise Gaussian processes.
t-processes regression There are a couple of classic cases in ML where \(t\)-processes arise, e.g. in Bayes NNs (Neal 1996) or GP literature (9.9 Rasmussen and Williams 2006). Recently there has been an uptick in actual applications of these processes in regression (Chen, Wang, and Gorban 2020; Shah, Wilson, and Ghahramani 2014; Tang et al.Probabilistic neural netshttps://danmackinlay.name/notebook/nn_probabilistic.htmlWed, 03 Nov 2021 12:45:40 +1100https://danmackinlay.name/notebook/nn_probabilistic.htmlBackgrounders MC sampling of weights by low-rank Matheron updates Mixture density networks Variational autoencoders Sampling via Monte Carlo Stochastic Gradient Descent as MC inference Laplace approximation Via random projections In Gaussian process regression Via measure transport Via infinite-width random nets Via NTK Ensemble methods Practicalities References Inferring densities and distributions in a massively parameterised deep learning setting.
This is not intrinsically a Bayesian thing to do but in practice much of the demand to do probabilistic nets comes from the demand for Bayesian posterior inference for neural nets.Polynomial baseshttps://danmackinlay.name/notebook/polynomial_bases.htmlWed, 06 Oct 2021 20:45:47 +1100https://danmackinlay.name/notebook/polynomial_bases.htmlFun things Well known facts Zoo tools References Placeholder.
Fun things Terry Tao on Conversions between standard polynomial bases.
Well known facts Xiu and Karniadakis (2002) mention the following “Well known facts”:
All orthogonal polynomials \(\left\{Q_{n}(x)\right\}\) satisfy a three-term recurrence relation \[ -x Q_{n}(x)=A_{n} Q_{n+1}(x)-\left(A_{n}+C_{n}\right) Q_{n}(x)+C_{n} Q_{n-1}(x), \quad n \geq 1 \] where \(A_{n}, C_{n} \neq 0\) and \(C_{n} / A_{n-1}>0 .\) Together with \(Q_{-1}(x)=0\) and \(Q_{0}(x)=1,\) all \(Q_{n}(x)\) can be determined by the recurrence relation.Bootstraphttps://danmackinlay.name/notebook/bootstrap.htmlSat, 18 Sep 2021 11:45:52 +1000https://danmackinlay.name/notebook/bootstrap.htmlBootstrap bias correction Bootstrap for dependent data Causal bootstrap As a Bayesian method Pedagogic References Resampling your own data to estimate how good your point-estimator is, and to reduce its bias. In general an intuitive technique. However, gets tricky for e.g. dependent data. For a handy crib sheet for bootstrap failure modes, see Thomas Lumley, When the bootstrap doesn’t work.
In the classical mode, this is a frequentist technique without an immediate Bayesian interpretation.Karhunen-Loève expansionshttps://danmackinlay.name/notebook/karhunen_loeve.htmlFri, 27 Aug 2021 10:00:45 +1000https://danmackinlay.name/notebook/karhunen_loeve.htmlGaussian References Suppose we have a collection \(\{\varphi_n\}\) of real valued functions on our index space \(T\), and a collection \(\{\xi_n\}\) of uncorrelated random variables. Now we define the random process \[ f(t)=\sum_{n=1}^{\infty} \xi_{n} \varphi_{n}(t). \] We might care about the first two moments of \(f,\) i.e. \[ \mathbb{E}\{f(s) f(t)\}=\sum_{n=1}^{\infty} \sigma_{n}^{2} \varphi_{n}(s) \varphi_{n}(t) \] and variance function \[ \mathbb{E}\left\{f^{2}(t)\right\}=\sum_{n=1}^{\infty} \sigma_{n}^{2} \varphi_{n}^{2}(t) \]
Now suppose that we have a stochastic process where the index \(T\) is a compact domain in \(\mathbb{R}^{N}\).Gaussian processeshttps://danmackinlay.name/notebook/gp.htmlWed, 23 Jun 2021 15:19:40 +1000https://danmackinlay.name/notebook/gp.htmlDerivatives and integrals Integral of a Gaussian process Derivative of a Gaussian process References “Gaussian Processes” are stochastic processes/fields with jointly Gaussian distributions of over all finite sets observation points. The most familiar of these to finance and physics people is usually the Gauss-Markov process, a.k.a. the Wiener process, but there are many others. These processes are convenient due to certain useful properties of the multivariate Gaussian distribution e.Random-forest-like methodshttps://danmackinlay.name/notebook/boosting_bagging.htmlThu, 17 Jun 2021 09:34:32 +1000https://danmackinlay.name/notebook/boosting_bagging.htmlRandom trees, forests, jungles Self-regularising properties Gradient boosting Bayes Implementations. LightGBM xgboost catboost surfin bartmachine References Doubling down on ensemble methods; mixing predictions from many weak learners (in this case decision trees) to get strong learners. Boosting, bagging and other weak-learner ensembles.
There are many flavours of random-forest-like learning systems. The rule of thumb seems to be “Fast to train, fast to use.Neural nets with basis decomposition layershttps://danmackinlay.name/notebook/nn_basis.htmlTue, 09 Mar 2021 12:06:42 +1100https://danmackinlay.name/notebook/nn_basis.htmlUnrolling: Implementing sparse coding using neural nets Convolutional neural networks as sparse coding Continuous basis functions References Neural networks incorporating basis decompositions.
Why might you want to do this? For one it is a different lense to analyze neural nets’ mysterious success through. For another, it gives you interpolation for free. There are possibly other reasons - perhaps the right basis gives you better priors for understanding a partial differential equation?Functional regressionhttps://danmackinlay.name/notebook/functional_data.htmlThu, 28 May 2020 11:17:20 +1000https://danmackinlay.name/notebook/functional_data.htmlRegression using curves Functional autoregression References Statistics where the samples are not just data but whole curves and manifolds, or subsamples from them. Function approximation meets statisticsm, especially in Karhunen-Loève expansion
Regression using curves To quote Jim Ramsay:
Functional data analysis, […] is about the analysis of information on curves or functions. For example, these twenty traces of the writing of “fda” are curves in two ways: first, as static traces on the page that you see after the writing is finished, and second, as two sets functions of time, one for the horizontal “X” coordinate, and the other for the vertical “Y” coordinate.Empirical estimation of informationhttps://danmackinlay.name/notebook/information_estimating.htmlTue, 28 Apr 2020 13:32:10 +1000https://danmackinlay.name/notebook/information_estimating.htmlHistogram estimator Parametric Monte Carlo parametric References This is an empirical probability metric estimation problem, with especially cruel error properties. There are a few different versions of this problem corresponding to various different information: Mutual information between two variables, KL divergence between two distributions, information of one variable; discrete variables, continuous variable… In the mutual information case this is an independence test.
Say I would like to know the mutual information of the laws of processes generating two streams \(X,Y\) of observations, with weak assumptions on the laws of the generation process.Mixture models for density estimationhttps://danmackinlay.name/notebook/mixture_models.htmlFri, 24 Apr 2020 14:02:02 +1000https://danmackinlay.name/notebook/mixture_models.htmlMoments of a mixture Mixture zoo “Classic Mixtures” Continuous mixtures Bayesian Dirichlet mixtures Non-affine mixtures In Bayesian variational inference Estimation/selection methods (Local) maximum likelihood Method of moments Minimum distance Regression smoothing formulation Convergence and model selection Large sample results for mixtures Finite sample results for mixtures Sieve method Akaike Information criterion Quantization and coding theory Minimum description length/BIC Unsatisfactory thing: scale parameter selection theory Connection to Mercer kernel methods Miscellaney References pyxelate uses mixture models to create pixel art colour palettesLearning Gamelanhttps://danmackinlay.name/notebook/learning_gamelan.htmlMon, 06 Apr 2020 16:21:53 +1000https://danmackinlay.name/notebook/learning_gamelan.htmlReferences Attention conservation notice: Crib notes for a 2 year long project which I ultimately abandoned in late 2018 about approximating convnet with recurrent neural networks for analysing time series. This project currently exists purely as LaTeX files on my hard drive, which need to be imported for future reference. I did learn some useful tricks along the way about controlling the poles of IIR filters for learning by gradient descent, and those will be actually interesting.Bias reductionhttps://danmackinlay.name/notebook/bias_reduction.htmlWed, 26 Feb 2020 10:54:41 +1100https://danmackinlay.name/notebook/bias_reduction.htmlReferences Trying to reduce bias in point estimators by, e.g. bootstrap. In, e.g. AIC we try to compensate for bias in the model selection. In bias reduction we try to eliminate it from our estimates.
This looks interesting: Kosmidis and Lunardon (2020)
The current work develops a novel method for the reduction of the asymptotic bias of M-estimators from general, unbiased estimating functions. We call the new estimation method reduced-bias M -estimation, or RBM -estimation in short.(Reproducing) kernel trickshttps://danmackinlay.name/notebook/kernel_methods.htmlMon, 20 Jan 2020 13:55:43 +1100https://danmackinlay.name/notebook/kernel_methods.htmlIntroductions Kernel approximation RKHS distribution embedding Specific kernels Non-scalar-valued “kernels” References WARNING: This is very old. If I were to write it now, I would write it differently. I might break apart kernel tricks from kernels and I might wonder when we need a countable Mercer-style kernel decomposition and when we can do without.
Kernel in the sense of the “kernel trick”. Not to be confused with smoothing-type convolution kernels, nor the dozens of related-but-slightly-different clashing definitions of kernel; those can have their own respective pages.Sparse codinghttps://danmackinlay.name/notebook/sparse_coding.htmlTue, 05 Nov 2019 16:28:28 +0100https://danmackinlay.name/notebook/sparse_coding.htmlResources Wavelet bases Matching Pursuits Learnable codings Codings with desired invariances Misc Implementations References Linear expansion with dictionaries of basis functions, with respect to which you wish your representation to be sparse; i.e. in the statistical case, basis-sparse regression. But even outside statistics, you wish simply to approximate some data compactly. My focus here is on the noisy-observation case, although the same results are recycled enough throughout the field.Discrete time Fourier and related transformshttps://danmackinlay.name/notebook/dtft.htmlThu, 17 Oct 2019 09:23:59 +1100https://danmackinlay.name/notebook/dtft.htmlChirp z-transform Windowing the DTFT Chromatic derivatives References Care and feeding of Discrete Fourier transforms (DTFT), especially Fast Fourier Transforms, and other operators on discrete time series. Complexity results, timings, algorithms, properties. These are useful in a vast number of applications, such as filter design, time series analysis, various nifty optimisations of other algorithms etc.
Chirp z-transform Chirplets, one-sided discrete Laplace transform related to damped sinusoid representation.Density estimationhttps://danmackinlay.name/notebook/density_estimation.htmlWed, 16 Oct 2019 09:37:51 +1100https://danmackinlay.name/notebook/density_estimation.htmlDivergence measures/contrasts Minimising Expected (or whatever) MISE Connection to point processes Spline/wavelet estimations Mixture models Gaussian processes Renormalizing flow models k-NN estimates Kernel density estimators Fancy ones References A statistical estimation problem where you are not trying to estimate a function of a distribution of random observations, but the distribution itself. In a sense, all of statistics implicitly does density estimation, but this is often instrumental in the course of discovering the some actual parameter of interest.The interpretation of densities as intensities and vice versahttps://danmackinlay.name/notebook/densities_and_intensities.htmlMon, 23 Sep 2019 16:53:57 +1000https://danmackinlay.name/notebook/densities_and_intensities.htmlBasis function method for density Intensities Basis function method for intensity Count regression Probability over boxes References Estimating densities by considering the observations drawn from that as a point process. In one dimension this gives us the particularly lovely trick of survival analysis, but the method is much more general, if not quite as nifty
Consider the problem of estimating the common density \(f(x)dx=dF(x)\) density of indexed i.Wacky regressionhttps://danmackinlay.name/notebook/wacky_regression.htmlThu, 02 May 2019 16:21:05 +1000https://danmackinlay.name/notebook/wacky_regression.htmlReferences I used to maintain a list of regression methods that were almost nonparametric, but as fun as that category was I was not actually using it, so I broke it apart into more conventional categories.
See bagging and bosting methods, neural networks, functional data analysis, Gaussian process regression and randomised regression.
References Fomel, Sergey. 2000. “Inverse B-Spline Interpolation.” Citeseer. Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.Survival analysis and reliabilityhttps://danmackinlay.name/notebook/survival_analysis.htmlTue, 12 Mar 2019 12:29:46 +1100https://danmackinlay.name/notebook/survival_analysis.htmlEstimating survival rates Life table method Nelson-Aalen estimates Other reliability stuff tools References Estimating survival rates Here’s the set-up: looking at a data set of individuals’ lifespans you would like to infer the distributions—Analysing when people die, or things break etc. The statistical problem of estimating how long people’s lives are is complicated somewhat by the particular structure of the data — loosely, “every person dies at most one time”, and there are certain characteristic difficulties that arise, such as right-censorship.Inner product spaceshttps://danmackinlay.name/notebook/hilbert_space.htmlMon, 11 Feb 2019 13:23:51 +1100https://danmackinlay.name/notebook/hilbert_space.htmlNormed spaces Operators Functionals Inner product space Classic inner product spaces \(\ell_2\) \(L_2\) Reproducing kernel Hilbert spaces Projection operators \(\renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\mathrm{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \renewcommand{\II}[1]{\mathbb{I}\{#1\}} \renewcommand{\inner}[2]{\langle #1,#2\rangle} \renewcommand{\Inner}[2]{\left\langle #1,#2\right\rangle} \renewcommand{\norm}[1]{\| #1\|} \renewcommand{\Norm}[1]{\|\langle #1\right\|} \renewcommand{\argmax}{\mathop{\mathrm{argmax}}} \renewcommand{\argmin}{\mathop{\mathrm{argmin}}} \renewcommand{\omp}{\mathop{\mathrm{OMP}}}\)
The most well-worn tool in the functional analysis kit. Let’s walk through the classic setup, as a refresher to my dusty brains.Normed spaceshttps://danmackinlay.name/notebook/banach_space.htmlFri, 04 Jan 2019 19:45:54 +1100https://danmackinlay.name/notebook/banach_space.htmlVector space Operators Linear integral operators Normed space \(\renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\mathrm{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \renewcommand{\II}[1]{\mathbb{I}\{#1\}} \renewcommand{\inner}[2]{\langle #1,#2\rangle} \renewcommand{\Inner}[2]{\left\langle #1,#2\right\rangle} \renewcommand{\norm}[1]{\| #1\|} \renewcommand{\Norm}[1]{\|\langle #1\right\|} \renewcommand{\argmax}{\mathop{\mathrm{argmax}}} \renewcommand{\argmin}{\mathop{\mathrm{argmin}}} \renewcommand{\omp}{\mathop{\mathrm{OMP}}}\)
Vector space An vector \(V\) space over \(\bb{F}\in\{\bb{C},\bb{R}\}\) is a set of objects which satisfy the rules of vector arithmetic, e.g. for all vectors \(x,y\in V\), and all scalars \(\alpha,\beta\in\bb{F}\), we have \(\alpha x + \beta y\in V.Integral probability metricshttps://danmackinlay.name/notebook/integral_probability_metrics.htmlTue, 31 Oct 2017 00:22:11 +1100https://danmackinlay.name/notebook/integral_probability_metrics.htmlReferences The intersection of reproducing kernel methods, dependence tests and probability metrics; where you use a clever RKHS embedding to measure differences between probability distributions.
A mere placeholder for now.
This abstract by Zoltán Szabó might serve to highlight some keywords.
Maximum mean discrepancy (MMD) and Hilbert-Schmidt independence criterion (HSIC) are among the most popular and successful approaches in applied mathematics to measure the difference and the independence of random variables, respectively.Kernel density estimatorshttps://danmackinlay.name/notebook/kde.htmlThu, 18 Aug 2016 11:45:00 +1000https://danmackinlay.name/notebook/kde.htmlBandwidth/kernel selection in density estimation Mixture models Does this work with uncertain point locations? Does this work with asymmetric kernels? Fast Gauss Transform and Fast multipole methods References A nonparametric method of approximating something from data by assuming that it’s close to the data distribution convolved with some kernel.
This is especially popular the target is a probability density function; Then you are working with a kernel density estimator.Kernel approximationhttps://danmackinlay.name/notebook/kernel_approximation_inversion.htmlWed, 27 Jul 2016 12:21:02 +1000https://danmackinlay.name/notebook/kernel_approximation_inversion.htmlDo you need to even calculate the Gram matrix? Stationary kernels Inner product kernels Lectures Implementations Connections References InkedIcon, on inversions in Hilbert space, after Charles Stross
A page where I document what I don’t know about kernel approximation. A page about what I do know would be empty.
What I mean is: approximating implicit Mercer kernel feature maps with explicit features; Equivalently, approximating the Gram matrix, which is also related to mixture model inference and clustering.Function approximation and interpolationhttps://danmackinlay.name/notebook/function_approximation.htmlThu, 09 Jun 2016 15:27:47 +1000https://danmackinlay.name/notebook/function_approximation.htmlChoosing the best approximation Polynomial spline smoothing of observations Polynomial bases Fourier bases Radial basis function approximation Rational approximation References On constructing an approximation of some arbitrary function, and measuring the badness thereof.
THIS IS CHAOS RIGHT NOW. I need to break out the sampling/interpolation problem for regular data, for one thing.
Choosing the best approximation In what sense? Most compact? Most easy to code?Deconvolutionhttps://danmackinlay.name/notebook/deconvolution.htmlMon, 11 Apr 2016 16:07:42 +1000https://danmackinlay.name/notebook/deconvolution.htmlVanilla deconvolution Deconvolution method in statistics References I wish, for a project of my own, to know about how to deconvolve with
High dimensional data irregularly sampled data inhomogenous (although known) convolution kernels This is in a signal processing setting; for the (closely-related) kernel-density estimation in a statistical setting, see kernel approximation. If you don’t know your noise spectrum, see blind deconvolution.
Vanilla deconvolution Wiener filtering!