optimization on Dan MacKinlay
https://danmackinlay.name/tags/optimization.html
Recent content in optimization on Dan MacKinlayHugo -- gohugo.ioen-usTue, 13 Apr 2021 15:56:46 +0800Differentiable model selection
https://danmackinlay.name/notebook/model_selection_diff.html
Tue, 13 Apr 2021 15:56:46 +0800https://danmackinlay.name/notebook/model_selection_diff.htmlReferences Maclaurin, Duvenaud, and Adams (2015):
Hyperparameter optimization by gradient descent
Each meta-iteration runs an entire training run of stochastic gradient de- scent to optimize elementary parameters (weights 1 and 2). Gradients of the validation loss with respect to hyperparameters are then computed by propagating gradients back through the elementary training iterations. Hyperparameters (in this case, learning rate and momentum schedules) are then updated in the direction of this hypergradient.Probability divergences
https://danmackinlay.name/notebook/probability_metrics.html
Fri, 26 Mar 2021 08:39:15 +1100https://danmackinlay.name/notebook/probability_metrics.htmlOverview Norms with respect to Lebesgue measure on the state space Relative distributions \(\phi\)-divergences Kullback-Leibler divergence Total variation distance Hellinger divergence \(\alpha\)-divergence \(\chi^2\) divergence Hellinger inequalities Pinsker inequalities Integral probability metrics Wasserstein distance(s) Bounded Lipschitz distance Fisher distances Others Induced topologies To read References Allison Chaney
Quantifying difference between probability measures. Measuring the distribution itself, for, e.g. badness of approximation of a statistical fit.Generically approximating probability distributions
https://danmackinlay.name/notebook/approximating_dists.html
Mon, 22 Mar 2021 14:20:29 +1100https://danmackinlay.name/notebook/approximating_dists.htmlStein’s method References There are various approximations we might use for a a probability distribution. Empirical CDFs, Kernel density estimates, variational approximation, Edgeworth expansions, Laplace approximations…
From each of these we might get close in some metric to the desired target.
This is a broad topic which I cannot hope to cover in full generality. Special cases of interest include
Statements about where the probability mass is with high probability (concentration theorems) statements about the asymptotic distributions of variables eventually approaching some distribution as some parameter goes to infinity (limit theorems.Stein’s method
https://danmackinlay.name/notebook/steins_method.html
Mon, 22 Mar 2021 14:20:29 +1100https://danmackinlay.name/notebook/steins_method.htmlNon-Gaussian Stein Multivariate Stein References A famous generic method for approximating distributions is Stein’s method of exchangeable pairs (Stein 1986, 1972). Wikipedia is good on this.
Meckes (2009) summarises.
Heuristically, the univariate method of exchangeable pairs goes as follows. Let \(W\) be a random variable conjectured to be approximately Gaussian; assume that \(\mathbb{E} W=0\) and \(\mathbb{E} W^{2}=1 .\) From \(W,\) construct a new random variable \(W^{\prime}\) such that the pair \(\left(W, W^{\prime}\right)\) has the same distribution as \(\left(W^{\prime}, W\right) .Optimal transport inference
https://danmackinlay.name/notebook/optimal_transport_inference.html
Tue, 16 Mar 2021 11:04:49 +1100https://danmackinlay.name/notebook/optimal_transport_inference.htmlTools References Doing inference where the probability metric is an optimal-transport one. Usually intractable, but desirable when we can get it.
Wasserstein GANs are argued to approximate this.
See e.g. (J. H. Huggins et al. 2018b, 2018a) for a particular Bayes posterior approximation to this.
Tools POT: Python Optimal Transport Optimal Transport Tools (accelerated jax-based sibling to POT) References Agueh, Martial, and Guillaume Carlier.Optimal transport metrics
https://danmackinlay.name/notebook/optimal_transport_metrics.html
Tue, 16 Mar 2021 10:53:23 +1100https://danmackinlay.name/notebook/optimal_transport_metrics.htmlAnalytic expressions Gaussian Kontorovich-Rubinstein duality “Neural Net distance” Fisher distance Sinkhorn divergence Awaiting filing Recommended introductions. Use in inference References I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.
Encyclopedia of Math says:
Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure.ML scaling laws in the massive model limit
https://danmackinlay.name/notebook/ml_scaling_large_models.html
Mon, 15 Mar 2021 17:36:40 +1100https://danmackinlay.name/notebook/ml_scaling_large_models.htmlBig transformers Bitter lesson Misc References Brief links on the theme of scaling in the extremely large model/large data limit.
Big transformers One fun result comes from Transformer language models. Possibly a fruitful new front in the complexity of statistics. An interesting observation way back in 2020 was that there seemed to be an unexpectedly trade-off where you can go faster by training a bigger network.Neural nets with implicit layers
https://danmackinlay.name/notebook/nn_implicit.html
Mon, 15 Mar 2021 12:16:50 +1100https://danmackinlay.name/notebook/nn_implicit.htmlReferences A unifying framework for various networks, including neural ODEs, where our layers are not simple forward operations but who exacluation is represented as some optimisation problem.
For some info see the NeurIPS 2020 tutorial, Deep Implicit Layers - Neural ODEs, Deep Equilibirum Models, and Beyond, by Zico Kolter, David Duvenaud, and Matt Johnson.
NB: This is different to the implicit representation method. Since implicit layers and implicit representation layers also occur in the same problems (such as ML PDES this terminological confusion will haunt us.Neural nets with basis decomposition layers
https://danmackinlay.name/notebook/nn_basis.html
Tue, 09 Mar 2021 12:06:42 +1100https://danmackinlay.name/notebook/nn_basis.htmlNeural networks with continuous basis functions Convolutional neural networks as sparse coding References Neural networks incorporating basis decompositions.
Why might you want to do this? For one it is a different lense to analyze neural nets’ mysterious success through. For another, it gives you interpolation for free. There are possibly other reasons - perhaps the right basis gives you better priors for undersstanding a partial differential equation?Reparameterization tricks in inference
https://danmackinlay.name/notebook/reparameterization_trick.html
Mon, 08 Mar 2021 18:07:53 +1100https://danmackinlay.name/notebook/reparameterization_trick.htmlFor variational autoencoders “Normalized” flows For density estimation Representational power of Tutorials References Approximating the desired distribution by perturbation of the available distribution
A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables to that I can differentiate through the parameters of a distribution”. Connections to optimal transport and likelihood free inference in that this trick can enable some clever approximate-likelihood approaches.Automatic differentiation
https://danmackinlay.name/notebook/autodiff.html
Mon, 08 Mar 2021 11:54:00 +1100https://danmackinlay.name/notebook/autodiff.htmlApplication to backpropagation Computational complexity Forward- versus reverse-mode Symbolic differentiation Misc Software jax Tensorflow Pytorch Julia Aesara taichi Classic python autograd Micrograd Enzyme Theano Casadi ADOL ad ceres solver audi algopy References Gradient field in python
Getting your computer to tell you the gradient of a function, without resorting to finite difference approximation, or coding an analytic derivative by hand. We usually mean this in the sense of automatic forward or reverse mode differentiation, which is not, as such, a symbolic technique, but symbolic differentiation gets an incidental look-in, and these ideas do of course relate.Hyperparameter optimization in ML
https://danmackinlay.name/notebook/hyperparam_opt.html
Mon, 01 Mar 2021 09:10:44 +1100https://danmackinlay.name/notebook/hyperparam_opt.htmlBayesian/surrogate optimisation Differentiable hyperparameter optimisation Random search Adaptive random search Implementations Determined Ray Optuna hyperopt auto-sklearn skopt spearmint SMAC AutoML References Split off from autoML.
The art of choosing the best hyperparameters for your ML model’s algorithms, of which there may be many.
Should you bother getting fancy about this? Ben Recht argues no, that random search is competitive with highly tuned Bayesian methods in hyperparameter tuning.Jax
https://danmackinlay.name/notebook/jax.html
Thu, 18 Feb 2021 07:58:37 +1100https://danmackinlay.name/notebook/jax.htmlIdioms Deep learning frameworks Haiku Flax Probabilistic programming frameworks Numpyro Stheno graph networks jax (python) is a successor to classic python/numpy autograd. It includes various code optimisation, jit-compilations, differentiating and vectorizing.
So, a numerical library with certain high performance machine-learning affordances. Note, it is not a deep learning framework per se, but rather the producer species at lowest trophic level of a deep learning ecosystem.Random-forest-like methods
https://danmackinlay.name/notebook/boosting_bagging.html
Thu, 11 Feb 2021 08:29:09 +1100https://danmackinlay.name/notebook/boosting_bagging.htmlRandom trees, forests, jungles Self-regularising properties Gradient boosting Bayes Implementations surfin xgboost catboost bartmachine References Doubling down on ensemble methods; mixing predictions from many weak learners (in this case decision trees) to get strong learners. “A selection of randomly stopped clocks is never far from wrong.”
There are many flavours of random-forest-like learning systems. The rule of thumb seems to be “Fast to train, fast to use.Wiener-Khintchine representation
https://danmackinlay.name/notebook/wiener_khintchine.html
Sun, 03 Jan 2021 15:47:35 +1100https://danmackinlay.name/notebook/wiener_khintchine.htmlWiener theorem: Deterministic case Wiener-Khinchine theorem: Spectral density of covariance kernels Bochner’s Theorem: stationary spectral kernels Yaglom’s theorem References \[ \renewcommand{\lt}{<} \renewcommand{\gt}{>} \renewcommand{\var}{\operatorname{Var}} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\ff}[1]{\mathfrak{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \]
Consider a real-valued stochastic process \(\{X_{\vv{t}}\}_{\vv{t}\in\mathcal{T}}\) over an index (metric) space \(\mathcal{T}\), i.e. a realisation of such process is a function \(\mathcal{T}\to\mathbb{R}\). For the sake of concreteness we will take \(\mathcal{T}=\mathbb{R}^{d}\) here.Distribution regression
https://danmackinlay.name/notebook/distribution_regression.html
Tue, 01 Dec 2020 08:31:58 +1100https://danmackinlay.name/notebook/distribution_regression.htmlReferences Poczos et al. (2013):
‘Distribution regression’ refers to the situation where a response \(Y\) depends on a covariate \(P\) where \(P\) is a probability distribution. The model is \(Y=f(P)+\mu\) where \(f\) is an unknown regression function and \(\mu\) is a random error. Typically, we do not observe \(P\) directly, but rather, we observe a sample from \(P .\)
References Bachoc, F., F.Recommender systems
https://danmackinlay.name/notebook/recommender_systems.html
Mon, 30 Nov 2020 14:55:18 +1100https://danmackinlay.name/notebook/recommender_systems.htmlReferences Not my area, but I need a landing page to refer to for some non-specialist contacts of mine.
I am most familiar with the matrix factorization approaches (e.g. factorization machines, NNMF) but there are many, e.g. variational autoencoder approaches are en vogue.
An overview by Javier lists many approaches.
Most Popular recommendations (the baseline) Item-User similarity based recommendations kNN Collaborative Filtering recommendations GBM based recommendations Non-Negative Matrix Factorization recommendations Factorization Machines (Steffen Rendle 2010) Field Aware Factorization Machines (Yuchin Juan, et al, 2016) Deep Learning based recommendations (Wide and Deep, Heng-Tze Cheng, et al, 2016) Neural Collaborative Filtering (Xiangnan He et al.Variational inference by message-passing in graphical models
https://danmackinlay.name/notebook/message_passing.html
Wed, 25 Nov 2020 17:42:32 +1100https://danmackinlay.name/notebook/message_passing.htmlReferences Variational inference where the model factorizes over some graphical independence structure, which means we get cheap and distributed inference. I am currently particularly interested in this for latent GP models. Many things can be expressed as message passing algorithms. The grandparent idea in this unification seems to be “Belief propagation”, a.k.a. “sum-product message-passing”, credited to (Pearl, 1982) for DAGs and then generalised to MRFs, PGMs, factor graphs etc.Tensorflow
https://danmackinlay.name/notebook/tensorflow.html
Tue, 10 Nov 2020 14:20:46 +1100https://danmackinlay.name/notebook/tensorflow.htmlAbstractions Tutorials Debugging Tensorboard Getting data in (Non-recurrent) convolutional networks Recurrent networks Official documentation Community guides Keras: The recommended way of using tensorflow Getting models out Training in the cloud because you don’t have NVIDIA sponsorship Extending Misc HOWTOs Nightly builds Dynamic graphs GPU selection Silencing tensorflow Hessians and higher order optimisation Manage tensorflow environments Optimisation tricks Probabilistic networks A C++/Python/etc neural network toolkit by Google.ELBO
https://danmackinlay.name/notebook/elbo.html
Wed, 28 Oct 2020 10:59:07 +1100https://danmackinlay.name/notebook/elbo.htmlReferences \(\renewcommand{\Ex}{\mathbb{E}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\kl}{\operatorname{KL}} \renewcommand{\H}{\mathbb{H}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\pd}{\partial}\)
On using the most convenient probability metric (i.e. KL divergence) to do variational inference.
There is nothing novel here. But everyone who is doing variational inference has to work through this just once, and I’m doing so here.
Yuge Shi’s introduction is the best short intro that gets to state-of-the-art. The canonical intro is de Garis Matthews (2017) who did a thesis on it.Gradient descent
https://danmackinlay.name/notebook/gd_1st_order.html
Tue, 27 Oct 2020 07:06:36 +1100https://danmackinlay.name/notebook/gd_1st_order.htmlCoordinate descent Accelerated Continuous approximations of iterations Online versus stochastic Conditional Gradient References Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.
There are only few things I wish to understand for the moment
Coordinate descent Descent each coordinate individually.
Small clever hack for certain domains: log gradient descent.
Accelerated How and when does it work?Efficient factoring of GP likelihoods
https://danmackinlay.name/notebook/gp_factoring.html
Mon, 26 Oct 2020 12:46:34 +1100https://danmackinlay.name/notebook/gp_factoring.htmlBasic sparsity via inducing variables SVI for Gaussian processes Latent Gaussian Process models References There are many ways to cleverly slice up GP likelihoods so that inference is cheap.
This page is about some of them, especially the union of sparse and variational tricks. Scalable Gaussian process regressions choose cunning factorisations such that the model collapses down to a lower-dimensional thing than it might have seemed to need, at least approximately.Differentiating through the Gamma
https://danmackinlay.name/notebook/gamma_diff.html
Thu, 15 Oct 2020 10:50:59 +1100https://danmackinlay.name/notebook/gamma_diff.htmlReferences Suppose I want to find a distributional gradient for a gamma process. Generalically I woudl find this via monte carlo gradient estimation.
Here is a problem-specific method:
I allow the latent random state to have more dimensions than a univariate. Let’s get specific. An example arises if we raid the random-variate-generation literature for transform methods to generate RNGs and differentiate A Gamma variate can be generated by a transformed normal and a uniform random variable,or two uniforms, depending on the parameter range.Automatic design of experiments
https://danmackinlay.name/notebook/design_of_experiments.html
Tue, 13 Oct 2020 17:50:06 +1100https://danmackinlay.name/notebook/design_of_experiments.htmlProblem statement Aquisition functions Connection to RL Implementation skopt Dragonfly PySOT GPyOpt Sigopt BoTorch/Ax spearmint SMAC References Closely related is AutoML, in that surrogate optimisation is a popular tool for such.
Problem statement According to Gilles Louppe and Manoj Kumar:
We are interested in solving
\[x^* = \arg \min_x f(x)\]
under the constraints that
\(f\) is a black box for which no closed form is known (nor its gradients); \(f\) is expensive to evaluate; evaluations of \(y=f(x)\) may be noisy.AutoML
https://danmackinlay.name/notebook/automl.html
Fri, 02 Oct 2020 06:29:47 +1000https://danmackinlay.name/notebook/automl.htmlReinforcement learning approaches Differentiable architecture search Implementations auto-sklearn References The sub-field of optimisation that specifically aims to automate model selection in machine learning. (and also occasionally ensemble construction)
There are two major approaches here that I am aware of, both of which are related in a kind of abstract way, but which are in practice different
Finding the right architecture for your nueral net, a.Data summarization
https://danmackinlay.name/notebook/data_summarization.html
Fri, 18 Sep 2020 06:21:41 +1000https://danmackinlay.name/notebook/data_summarization.htmlCoresets representative subsets Directly approximate log likelihood References Summary statistics which don’t require you to keep all the data but which allow you to do inference nearly as well. e.g sufficient statistics in exponential families allow you to do do certain kind of inference perfectly without anything except summaries. Methods such as variational Bayes summarize data by maintaining a posterior density (usually a mixture models) as a summary of all the data, at some cost in accuracy.Variational autoencoders
https://danmackinlay.name/notebook/variational_autoencoders.html
Thu, 10 Sep 2020 13:17:16 +1000https://danmackinlay.name/notebook/variational_autoencoders.htmlReferences A variational autoencoder uses a limited latent distribution to approximate a complex posterior distribution
A trick in e.g. variational inference/ probabilistic neural nets where we presume that the model is generated by a low-dimensional latent space, which is, if you squint at it, kind of the information bottleneck trick but in a probabilistic setting. To my mind it is a sorta-kinda nonparametric approximate Bayes method.Online learning
https://danmackinlay.name/notebook/online_learning.html
Wed, 26 Aug 2020 16:48:40 +1000https://danmackinlay.name/notebook/online_learning.htmlMirror descent Follow-the-regularized leader Parameter-free Covariance References An online learning perspective gives bounds on the regret: the gap between in performance between online estimation and the optimal estimator when we have access to the entire data.
A lot of things are sort-of online learning; stochastic gradient descent, for example, is closely related. However, if you meet someone who claims to study “online learning” they usually mean to emphasis particular things.(Outlier) robust statistics
https://danmackinlay.name/notebook/robust_statistics.html
Tue, 14 Jul 2020 07:10:31 +1000https://danmackinlay.name/notebook/robust_statistics.htmlTODO Corruption models M-estimation with robust loss MM-estimation Median-based estimators Others References There are also robust estimators in econometrics; then it means something about good behaviour under heteroskedastic and/or correlated error. Robust Bayes means something about inference that is robust to the choice of prior (which could overlap but is a rather different emphasis).
Outlier robustness is AFAICT more-or-less a frequentist project. Bayesian approaches seem to achieve robustness largely by choosing heavy-tailed priors or heavy-tailed noise distributions where they might have chosen light-tailed ones, e.Automatic differentiation in Julia
https://danmackinlay.name/notebook/julia_autodiff.html
Fri, 05 Jun 2020 12:22:53 +1000https://danmackinlay.name/notebook/julia_autodiff.htmlJulia has an embarrassment of different methods of automatic differentiation (Homoiconicity and introspection makes this comparatively easy.) and it’s not always clear the comparative selling points of each.
The juliadiff project produces ForwardDiff.jl and ReverseDiff.jl which do what I would expect, namely autodiff in forward and reverse mode respectively. ForwardDiff claims to be advanced. ReverseDiff works but is abandoned.
ForwardDiff implements methods to take derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functionsVoice fakes
https://danmackinlay.name/notebook/voice_fakes.html
Wed, 27 May 2020 20:42:58 +1000https://danmackinlay.name/notebook/voice_fakes.htmlStyle transfer Text to speech References A placeholder. Generating speech, without a speaker, or possibly style transferring speech.
Style transfer You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable.
There has been a tendency for the open source ones to be fairly mediocre while the the pay-to-play options leave provocative demos about but do not let you use them.Variational inference
https://danmackinlay.name/notebook/variational_inference.html
Sun, 24 May 2020 12:04:18 +1000https://danmackinlay.name/notebook/variational_inference.htmlIntroduction Philosophical interpretations In graphical models Inference via KL divergence Mixture models Reparameterization trick Autoencoders Loss functions References Approximating the intractable measure (right) with a transformation of a tractable one (left)
Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.Matrix calculus
https://danmackinlay.name/notebook/matrix_calculus.html
Tue, 19 May 2020 12:00:06 +1000https://danmackinlay.name/notebook/matrix_calculus.htmlMatrix differentials Indexed tensor calculus References We can generalise the high school calculus, which is about scalar functions of a scalar argument, in various ways, to handle matrix-valued functions or matrix-valued arguments. One could generalise this further, by to full tensor calculus. But it happens that specifically matrix/vector operations are at a useful point of complexity for lots of algorithms, kind of a MVP. (I usually want this for higher order gradient descent.Evolution
https://danmackinlay.name/notebook/evolution.html
Mon, 27 Apr 2020 21:44:18 +1000https://danmackinlay.name/notebook/evolution.htmlConnection with mathematical optimisation Evolution of cooperation Extended Evolutionary Synthesis To read References Ruben Bolling, How to draw Doug
Biological adaptation by mutation, natural selection and random diffusion.
Analogies and disanalogies between tooth-and-claw natural selection, and our statistical learning methods. Stochastic process models of gene and population dynamics.
See also geometry of fitness landscapes, cooperation.
I would like to better understand:
Sundry stochastic models of allele diffusion.Gradient descent, first-order, stochastic
https://danmackinlay.name/notebook/gd_1st_order_stochastic.html
Fri, 07 Feb 2020 12:27:58 +1100https://danmackinlay.name/notebook/gd_1st_order_stochastic.htmlVariance-reduced Normalized Sundry Hacks References Stochastic optimization, uses noisy (possibly approximate) 1st-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
That this works with little fuss in very high dimensions is a major pillar of deep learning.
The original version, given in terms of root finding, is (Herbert Robbins and Monro 1951) who later generalised analysis in (H.Gradient descent, Newton-like, stochastic
https://danmackinlay.name/notebook/gd_2nd_order_stochastic.html
Thu, 23 Jan 2020 10:35:22 +1100https://danmackinlay.name/notebook/gd_2nd_order_stochastic.htmlSubsampling General case References Stochastic Newton-type optimization, unlike deterministic Newton optimisation, uses noisy (possibly approximate) 2nd-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
Subsampling Most of the good tricks here are set up for ML-style training losses where the bottleneck is summing a large number of loss functions.
LiSSA attempts to make 2nd order gradient descent methods scale to large parameter sets (Agarwal, Bullins, and Hazan 2016):Phase retrieval
https://danmackinlay.name/notebook/phase_retrieval.html
Thu, 07 Nov 2019 18:28:49 +0100https://danmackinlay.name/notebook/phase_retrieval.htmlReferences You know the power of the signal; what is the phase? Griffin-Lim algorithm, Wirtinger flow methods based on Wirtinger calculus, Phase-gradient heap integration (Pru and Søndergaard 2016).
Diagram from TiFGAN (Marafioti et al. 2019) via CJ.
TODO: investigate Yue M Lu’s work on phase retrieval as an important example in a large classe of somewhat- analytically-understood nonconvex problems, starting from his recent slide deck on that theme.Sparse coding
https://danmackinlay.name/notebook/sparse_coding.html
Tue, 05 Nov 2019 16:28:28 +0100https://danmackinlay.name/notebook/sparse_coding.htmlResources Wavelet bases Matching Pursuits Learnable codings Codings with desired invariances Misc Implementations References Linear expansion with dictionaries of basis functions, with respect to which you wish your representation to be sparse; i.e. in the statistical case, basis-sparse regression. But even outside statistics, you wish simply to approximate some data compactly. My focus here is on the noisy-observation case, although the same results are recycled enough throughout the field.Optimal control
https://danmackinlay.name/notebook/optimal_control.html
Fri, 01 Nov 2019 12:58:54 +1100https://danmackinlay.name/notebook/optimal_control.htmlNuts and bolts Online References Nothing to see here; I don’t do optimal control. But here are some notes for when I thought I might.
Feedback Systems: An Introduction for Scientists and Engineers by Karl J. Åström and Richard M. Murray is an interesting control systems theory course from Caltech.
The online control blog post mentioned below has a summary:
Perhaps the most fundamental setting in control theory is a LDS is with quadratic costs \(c_t\) and i.Gradient descent, Higher order
https://danmackinlay.name/notebook/gd_3rd_order.html
Sat, 26 Oct 2019 16:56:24 +0800https://danmackinlay.name/notebook/gd_3rd_order.htmlReferences Newton-type optimization uses 2nd-order gradient information (i.e. a Hessian matrix) to solve optimiztion problems. Higher order optimisation uses 3rd order gradients and so on. They are elegant for univariate functions.
This is rarely done in problems that I face because
3rd order derivatives of multivariate optimisations are usually too big in time and space complexity to be tractable They are not (simply) expressible as matrices so can benefit from a little tensor theory.Statistical learning theory for time series
https://danmackinlay.name/notebook/learning_theory_time_series.html
Tue, 01 Oct 2019 16:20:07 +1000https://danmackinlay.name/notebook/learning_theory_time_series.htmlReferences Statistical learning theory for dependent data such as time series and possibly other dependency structures. But I only know about result for time series
Non-stationary, non-asymptotic bounds please. Keywords: Ergodic, α-, β-mixing.
Mohri and Kuznetsov have done lots of work here; See, e.g. their NIPS2016 tutorial. There seem to be a lot of types of ergodic/mixing results, about which I know as yet nothing.Wirtinger calculus
https://danmackinlay.name/notebook/wirtinger_calculus.html
Tue, 10 Sep 2019 10:02:46 +1000https://danmackinlay.name/notebook/wirtinger_calculus.htmlReferences How do you differentiate real-valued functions of complex arguments? Wirtinger calculus. This is a ridiculous hack that happens to work well for signal processing over the complex field, especially in optimisation. It arises naturally in, for example, phase retrieval, (Zhang and Liang 2016; Candes, Li, and Soltanolkotabi 2015; Chen and Candès 2015; Seuret and Gouaisbaut 2013). Because of its area of popularity, this will almost surely arise in combination also of matrix calculus.Large sample theory
https://danmackinlay.name/notebook/large_sample_theory.html
Mon, 09 Sep 2019 12:52:51 +1000https://danmackinlay.name/notebook/large_sample_theory.htmlFisher Information Convolution Theorem References Delta methods, influence functions, and so on. Convolution theorems, local asymptotic minimax theorems.
A convenient feature of M-estimation, and especially maximum likelihood esteimation is simple behaviour of estimators in the asymptotic large-sample-size limit, which can give you, e.g. variance estimates, or motivate information criteria, or robust statistics, optimisation etc.
In the most celebrated and convenient cases case asymptotic bounds are about normally-distributed errors, and these are typically derived through Local Asymptotic Normality theorems.Gradient descent, Newton-like
https://danmackinlay.name/notebook/gd_2nd_order.html
Tue, 03 Sep 2019 14:06:19 +1000https://danmackinlay.name/notebook/gd_2nd_order.htmlVanilla Newton methods Quasi Newton methods Hessian free Natural gradient descent. Stochastic References Newton-type optimization, unlike basic gradient descent, uses (possibly approximate) 2nd-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
Optimization over arbitrary functions typically gets discussed in terms of line-search and trust-region methods, both of which can be construed, AFAICT, as second order methods.Semidefinite proramming
https://danmackinlay.name/notebook/semidefinite_programming.html
Sat, 29 Jun 2019 16:33:47 +0200https://danmackinlay.name/notebook/semidefinite_programming.html References “The most generalised version of convex programming”.
Semidefinite Programming L. Vandenberghe and S. Boyd
Convex Optimization and Euclidean Distance Geometry 2ε by Jon Dattoro
References Fruend, Robert. n.d. “Introduction to Semidefinite Programming,” 51. Vandenberghe, Lieven, and Stephen P. Boyd. 1996. “Semidefinite Programming.” SIAM Review 38 (1): 49–95. https://web.stanford.edu/~boyd/papers/sdp.html. Optimisation
https://danmackinlay.name/notebook/optim.html
Thu, 27 Jun 2019 09:53:15 +0200https://danmackinlay.name/notebook/optim.htmlGeneral Brief intro material Textbooks To file Alternating Direction Method of Multipliers Optimisation on manifolds Gradient-free optimization “Meta-heuristic” methods Annealing and Monte Carlo optimisation methods Expectation maximization Parallel Implementations To file Miscellaneous optimisation techniques suggested on Linkedin Primal/dual problems Majorization-minorization Difference-of-Convex-objectives References Crawling through alien landscapes in the fog, looking for mountain peaks.
I’m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.(Weighted) least squares fits
https://danmackinlay.name/notebook/least_squares.html
Wed, 22 May 2019 11:52:37 +1000https://danmackinlay.name/notebook/least_squares.htmlIteratively reweighted References A classic. Surprisingly deep.
A few non-comprehensive notes to approximating by the arbitrary-but-convenient expedient of minimising the sum of the squares of the deviances.
As used in many many problems. e.g. lasso regression.
Nonlinear least squares with ceres-solver:
Ceres Solve is an open source C++ library for modeling and solving large, complicated optimization problems. It can be used to solve Non-linear Least Squares problems with bounds constraints and general unconstrained optimization problems.Wiener theorem
https://danmackinlay.name/notebook/wiener_theorem.html
Wed, 08 May 2019 13:47:56 +1000https://danmackinlay.name/notebook/wiener_theorem.htmlReferences The special deterministic case of the Wiener-Khintchine theorem, written up with a slightly different notation for a slightly different project.
\[ \renewcommand{\lt}{<} \renewcommand{\gt}{>} \renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\boldsymbol{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\ff}[1]{\mathfrak{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \renewcommand{\II}[1]{\mathbb{I}\{#1\}} \renewcommand{\inner}[2]{\langle #1,#2\rangle} \renewcommand{\Inner}[2]{\left\langle #1,#2\right\rangle} \renewcommand{\finner}[3]{\langle #1,#2;#3\rangle} \renewcommand{\FInner}[3]{\left\langle #1,#2;#3\right\rangle} \renewcommand{\dinner}[2]{[ #1,#2]} \renewcommand{\DInner}[2]{\left[ #1,#2\right]} \renewcommand{\norm}[1]{\| #1\|} \renewcommand{\Norm}[1]{\left\| #1\right\|} \renewcommand{\fnorm}[2]{\| #1;#2\|} \renewcommand{\FNorm}[2]{\left\| #1;#2\right\|} \renewcommand{\trn}[1]{\mathcal{#1}} \renewcommand{\ftrn}[2]{\mathcal{#1}_{#2}} \renewcommand{\Ftrn}[3]{\mathcal{#1}_{#2}\left\{\right\}} \renewcommand{\argmax}{\mathop{\mathrm{argmax}}} \renewcommand{\argmin}{\mathop{\mathrm{argmin}}} \renewcommand{\omp}{\mathop{\mathrm{OMP}}} \]
As seen in correlograms.Wacky regression
https://danmackinlay.name/notebook/wacky_regression.html
Thu, 02 May 2019 16:21:05 +1000https://danmackinlay.name/notebook/wacky_regression.htmlReferences I used to maintain a list of regression methods that were almost nonparametric, but as fun as that category was I was not actually suing it often so I broke it up.
See bagging and bosting methods, neural networks, functional data analysis, gaussian process regression and randomised regression.
References Fomel, Sergey. 2000. “Inverse B-Spline Interpolation.” Citeseer. http://www.reproducibility.org/RSF/book/sep/bspl/paper.pdf. Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.Nearly sufficient statistics
https://danmackinlay.name/notebook/nearly_sufficient_statistics.html
Mon, 14 Jan 2019 15:10:53 +1100https://danmackinlay.name/notebook/nearly_sufficient_statistics.htmlSufficient statistics in exponential families References 🏗
I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.
By starting from the idea of sufficient statistics, we come to the idea of variational inference in a natural way, via some other interesting stopovers.
Consider the Bayes filtering setup.