optimization on The Dan MacKinlay family of variably-well-considered enterprises
https://danmackinlay.name/tags/optimization.html
Recent content in optimization on The Dan MacKinlay family of variably-well-considered enterprisesHugo -- gohugo.ioen-usMon, 30 Nov 2020 14:55:18 +1100Recommender systems
https://danmackinlay.name/notebook/recommender_systems.html
Mon, 30 Nov 2020 14:55:18 +1100https://danmackinlay.name/notebook/recommender_systems.htmlNot my area, but I need a landing page to refer to for some non-specialist contacts of mine.
I am most familiar with the matrix factorization approaches (e.g. factorization machines, NNMF) but there are many, e.g. variational autoencoder approaches are en vogue.
An overview by Javier lists many approaches.
Most Popular recommendations (the baseline) Item-User similarity based recommendations kNN Collaborative Filtering recommendations GBM based recommendations Non-Negative Matrix Factorization recommendations Factorization Machines (Steffen Rendle 2010) Field Aware Factorization Machines (Yuchin Juan, et al, 2016) Deep Learning based recommendations (Wide and Deep, Heng-Tze Cheng, et al, 2016) Neural Collaborative Filtering (Xiangnan He et al.Jax
https://danmackinlay.name/notebook/jax.html
Mon, 30 Nov 2020 13:13:44 +1100https://danmackinlay.name/notebook/jax.htmlDeep learning Haiku Flax Probabilistic programming jax (python) is a successor to classic python autograd. It includes various code optimisation, jit-compilations, differentiating and vectorizing.
The pitch:
JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.Variational inference by message-passing in graphical models
https://danmackinlay.name/notebook/message_passing.html
Wed, 25 Nov 2020 17:42:32 +1100https://danmackinlay.name/notebook/message_passing.htmlVariational inference where the model factorizes over some graphical independence structure, which means we get cheap and distributed inference. I am currently particularly interested in this for latent GP models. Many things can be expressed as message passing algorithms. The grandparent idea in this unification seems to be “Belief propagation”, a.k.a. “sum-product message-passing”, credited to (Pearl, 1982) for DAGs and then generalised to MRFs, PGMs, factor graphs etc.Tensorflow
https://danmackinlay.name/notebook/tensorflow.html
Tue, 10 Nov 2020 14:20:46 +1100https://danmackinlay.name/notebook/tensorflow.htmlAbstractions Tutorials Debugging Tensorboard Getting data in (Non-recurrent) convolutional networks Recurrent networks Official documentation Community guides Keras: The recommended way of using tensorflow Getting models out Training in the cloud because you don’t have NVIDIA sponsorship Extending Misc HOWTOs Nightly builds Dynamic graphs GPU selection Silencing tensorflow Hessians and higher order optimisation Manage tensorflow environments Optimisation tricks Probabilistic networks A C++/Python/etc neural network toolkit by Google.ELBO
https://danmackinlay.name/notebook/elbo.html
Wed, 28 Oct 2020 10:59:07 +1100https://danmackinlay.name/notebook/elbo.html\(\renewcommand{\Ex}{\mathbb{E}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\kl}{\operatorname{KL}} \renewcommand{\H}{\mathbb{H}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\pd}{\partial}\)
On using the most convenient probability metric (i.e. KL divergence) to do variational inference.
There is nothing novel here. But everyone who is doing variational inference has to work through this just once, and I’m doing so here.
Yuge Shi’s introduction is the best short intro that gets to state-of-the-art. The canonical intro is de Garis Matthews (2017) who did a thesis on it.Gradient descent
https://danmackinlay.name/notebook/gd_1st_order.html
Tue, 27 Oct 2020 07:06:36 +1100https://danmackinlay.name/notebook/gd_1st_order.htmlCoordinate descent Accelerated Continuous approximations of iterations Online versus stochastic Conditional Gradient Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.
There are only few things I wish to understand for the moment
Coordinate descent Descent each coordinate individually.
Small clever hack for certain domains: log gradient descent.
Accelerated How and when does it work?Efficient factoring of GP likelihoods
https://danmackinlay.name/notebook/gp_factoring.html
Mon, 26 Oct 2020 12:46:34 +1100https://danmackinlay.name/notebook/gp_factoring.htmlBasic sparsity via inducing variables SVI for Gaussian processes Latent Gaussian Process models There are many ways to cleverly slice up GP likelihoods so that inference is cheap.
This page is about some of them, especially the union of sparse and variational tricks. Scalable Gaussian process regressions choose cunning factorisations such that the model collapses down to a lower-dimensional thing than it might have seemed to need, at least approximately.Differentiating through the Gamma
https://danmackinlay.name/notebook/gamma_diff.html
Thu, 15 Oct 2020 10:50:59 +1100https://danmackinlay.name/notebook/gamma_diff.htmlSuppose I want to find a distributional gradient for a gamma process. Generalically I woudl find this via monte carlo gradient estimation.
Here is a problem-specific method:
I allow the latent random state to have more dimensions than a univariate. Let’s get specific. An example arises if we raid the random-variate-generation literature for transform methods to generate RNGs and differentiate A Gamma variate can be generated by a transformed normal and a uniform random variable,or two uniforms, depending on the parameter range.Automatic design of experiments
https://danmackinlay.name/notebook/design_of_experiments.html
Tue, 13 Oct 2020 17:50:06 +1100https://danmackinlay.name/notebook/design_of_experiments.htmlProblem statement Aquisition functions Connection to RL Implementation skopt Dragonfly PySOT GPyOpt Sigopt BoTorch/Ax spearmint SMAC Closely related is AutoML, in that surrogate optimisation is a popular tool for such.
Problem statement According to Gilles Louppe and Manoj Kumar:
We are interested in solving
\[x^* = \arg \min_x f(x)\]
under the constraints that
\(f\) is a black box for which no closed form is known (nor its gradients); \(f\) is expensive to evaluate; evaluations of \(y=f(x)\) may be noisy.Hyperparameter optimization in ML
https://danmackinlay.name/notebook/hyperparam_opt.html
Tue, 06 Oct 2020 10:42:44 +1100https://danmackinlay.name/notebook/hyperparam_opt.htmlBayesian/surrogate optimisation Differentiable hyperparameter optimisation Random search Adaptive random search Implementations Determined Ray Optuna hyperopt auto-sklearn skopt spearmint SMAC AutoML Split off from autoML.
The art of choosing the best hyperparameters for your ML model’s algorithms, of which there may be many.
Should you bother getting fancy about this? Ben Recht argues no, that random search is competitive with highly tuned Bayesian methods in hyperparameter tuning.AutoML
https://danmackinlay.name/notebook/automl.html
Fri, 02 Oct 2020 06:29:47 +1000https://danmackinlay.name/notebook/automl.htmlReinforcement learning approaches Differentiable architecture search Implementations auto-sklearn The sub-field of optimisation that specifically aims to automate model selection in machine learning. (and also occasionally ensemble construction)
There are two major approaches here that I am aware of, both of which are related in a kind of abstract way, but which are in practice different
Finding the right architecture for your nueral net, a.Automatic differentiation
https://danmackinlay.name/notebook/autodiff.html
Wed, 30 Sep 2020 08:18:21 +1000https://danmackinlay.name/notebook/autodiff.htmlApplication to backpropagation Computational complexity Forward- versus reverse-mode Symbolic differentiation Misc Software jax Tensorflow Pytorch Julia taichi Classic python autograd Micrograd Theano algopy Casadi ADOL ad ceres solver audi Gradient field in python
Getting your computer to tell you the gradient of a function, without resorting to finite difference approximation, or coding an analytic derivative by hand. We usually mean this in the sense of automatic forward or reverse mode differentiation, which is not, as such, a symbolic technique, but symbolic differentiation gets an incidental look-in, and these ideas do of course relate.Data summarization
https://danmackinlay.name/notebook/data_summarization.html
Fri, 18 Sep 2020 06:21:41 +1000https://danmackinlay.name/notebook/data_summarization.htmlCoresets representative subsets Directly approximate log likelihood Summary statistics which don’t require you to keep all the data but which allow you to do inference nearly as well. e.g sufficient statistics in exponential families allow you to do do certain kind of inference perfectly without anything except summaries. Methods such as variational Bayes summarize data by maintaining a posterior density (usually a mixture models) as a summary of all the data, at some cost in accuracy.Reparameterization tricks in inference
https://danmackinlay.name/notebook/reparameterization_trick.html
Thu, 10 Sep 2020 13:24:50 +1000https://danmackinlay.name/notebook/reparameterization_trick.htmlFor variational autoencoders Normalized flows For vanilla density estimation Tutorials Approximating the desired distribution by perturbation of the available distribution
A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables to that I can differentiate through the parameters of a distribution”. Connections to optimal transport and likelihood free inference in that this trick can enable some clever approximate-likelihood approaches.Variational autoencoders
https://danmackinlay.name/notebook/variational_autoencoders.html
Thu, 10 Sep 2020 13:17:16 +1000https://danmackinlay.name/notebook/variational_autoencoders.htmlA variational autoencoder uses a limited latent distribution to approximate a complex posterior distribution
A trick in e.g. variational inference/ probabilistic neural nets where we presume that the model is generated by a low-dimensional latent space, which is, if you squint at it, kind of the information bottleneck trick but in a probabilistic setting. To my mind it is a sorta-kinda nonparametric approximate Bayes method. But that tells you nothing.Online learning
https://danmackinlay.name/notebook/online_learning.html
Wed, 26 Aug 2020 16:48:40 +1000https://danmackinlay.name/notebook/online_learning.htmlMirror descent Follow-the-regularized leader Parameter-free Covariance An online learning perspective gives bounds on the regret: the gap between in performance between online estimation and the optimal estimator when we have access to the entire data.
A lot of things are sort-of online learning; stochastic gradient descent, for example, is closely related. However, if you meet someone who claims to study “online learning” they usually mean to emphasis particular things.(Outlier) robust statistics
https://danmackinlay.name/notebook/robust_statistics.html
Tue, 14 Jul 2020 07:10:31 +1000https://danmackinlay.name/notebook/robust_statistics.htmlTODO Corruption models M-estimation with robust loss MM-estimation Median-based estimators Others There are also robust estimators in econometrics; then it means something about good behaviour under heteroskedastic and/or correlated error. Robust Bayes means something about inference that is robust to the choice of prior (which could overlap but is a rather different emphasis).
Outlier robustness is AFAICT more-or-less a frequentist project. Bayesian approaches seem to achieve robustness largely by choosing heavy-tailed priors or heavy-tailed noise distributions where they might have chosen light-tailed ones, e.Boosting, bagging and other weak-learner ensemble methods
https://danmackinlay.name/notebook/boosting_bagging.html
Sun, 07 Jun 2020 06:37:42 +1000https://danmackinlay.name/notebook/boosting_bagging.htmlQuestions Random trees, forests, jungles Self-regularising properties Gradient boosting Bayes Implementations xgboost catboost bartmachine Ensemble methods; mixing predictions from many weak learners to get strong learners.
The rule of thumb seems to be “Fast to train, fast to use. Gets you results. May not get you answers.” So, like neural networks but from the previous hype cycle.
Questions In a different context, I’ve run into the general ensemble method model averaging; How does that relate to boosting/bagging algorithms?Automatic differentiation in Julia
https://danmackinlay.name/notebook/julia_autodiff.html
Fri, 05 Jun 2020 12:22:53 +1000https://danmackinlay.name/notebook/julia_autodiff.htmlJulia has an embarrassment of different methods of automatic differentiation (Homoiconicity and introspection makes this comparatively easy.) and it’s not always clear the comparative selling points of each.
The juliadiff project produces ForwardDiff.jl and ReverseDiff.jl which do what I would expect, namely autodiff in forward and reverse mode respectively. ForwardDiff claims to be very advanced. ReverseDiff works but is abandoned.
ForwardDiff implements methods to take derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functionsVoice fakes
https://danmackinlay.name/notebook/voice_fakes.html
Wed, 27 May 2020 20:42:58 +1000https://danmackinlay.name/notebook/voice_fakes.htmlStyle transfer Text to speech A placeholder. Generating speech, without a speaker, or possibly style transferring speech.
Style transfer You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable, although the open-source options are not impressive, the pay-to-play options are getting very good.
Kyle Kastner’s suggestions
VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.Variational inference
https://danmackinlay.name/notebook/variational_inference.html
Sun, 24 May 2020 12:04:18 +1000https://danmackinlay.name/notebook/variational_inference.htmlIntroduction Philosophical interpretations In graphical models Inference via KL divergence Mixture models Reparameterization trick Autoencoders Loss functions Approximating the intractable measure (right) with a transformation of a tractable one (left)
Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.Matrix calculus
https://danmackinlay.name/notebook/matrix_calculus.html
Tue, 19 May 2020 12:00:06 +1000https://danmackinlay.name/notebook/matrix_calculus.htmlMatrix differentials Indexed tensor calculus We can generalise the high school calculus, which is about scalar functions of a scalar argument, in various ways, to handle matrix-valued functions or matrix-valued arguments. One could generalise this further, by to full tensor calculus. But it happens that specifically matrix/vector operations are at a useful point of complexity for lots of algorithms, kind of a MVP. (I usually want this for higher order gradient descent.Evolution
https://danmackinlay.name/notebook/evolution.html
Mon, 27 Apr 2020 21:44:18 +1000https://danmackinlay.name/notebook/evolution.htmlConnection with mathematical optimisation Evolution of cooperation Extended Evolutionary Synthesis To read Ruben Bolling, How to draw Doug
Biological adaptation by mutation, natural selection and random diffusion.
Analogies and disanalogies between tooth-and-claw natural selection, and our statistical learning methods. Stochastic process models of gene and population dynamics.
See also geometry of fitness landscapes, cooperation.
I would like to better understand:
Sundry stochastic models of allele diffusion.Optimal transport metrics
https://danmackinlay.name/notebook/optimal_transport_metrics.html
Fri, 06 Mar 2020 16:21:07 +1100https://danmackinlay.name/notebook/optimal_transport_metrics.htmlAnalytic expressions Gaussian Kontorovich-Rubinstein duality “Neural Net distance” Fisher distance Sinkhorn divergence Awaiting filing Recommended introductions. I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.
I am about to do a reading group based on Peyré’s course, so mayb be harmonising the notation here with that soon.Gradient descent, first-order, stochastic
https://danmackinlay.name/notebook/gd_1st_order_stochastic.html
Fri, 07 Feb 2020 12:27:58 +1100https://danmackinlay.name/notebook/gd_1st_order_stochastic.htmlVariance-reduced Normalized Sundry Hacks Stochastic optimization, uses noisy (possibly approximate) 1st-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
That this works with little fuss in very high dimensions is a major pillar of deep learning.
The original version, given in terms of root finding, is (Herbert Robbins and Monro 1951) who later generalised analysis in (H.Gradient descent, Newton-like, stochastic
https://danmackinlay.name/notebook/gd_2nd_order_stochastic.html
Thu, 23 Jan 2020 10:35:22 +1100https://danmackinlay.name/notebook/gd_2nd_order_stochastic.htmlSubsampling General case Stochastic Newton-type optimization, unlike deterministic Newton optimisation, uses noisy (possibly approximate) 2nd-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
Subsampling Most of the good tricks here are set up for ML-style training losses where the bottleneck is summing a large number of loss functions.
LiSSA attempts to make 2nd order gradient descent methods scale to large parameter sets (Agarwal, Bullins, and Hazan 2016):Phase retrieval
https://danmackinlay.name/notebook/phase_retrieval.html
Thu, 07 Nov 2019 18:28:49 +0100https://danmackinlay.name/notebook/phase_retrieval.htmlYou know the power of the signal; what is the phase? Griffin-Lim algorithm, Wirtinger flow methods based on Wirtinger calculus, Phase-gradient heap integration (Pru and Søndergaard 2016).
Diagram from TiFGAN (Marafioti et al. 2019) via CJ.
TODO: investigate Yue M Lu’s work on phase retrieval as an important example in a large classe of somewhat- analytically-understood nonconvex problems, starting from his recent slide deck on that theme.Sparse coding
https://danmackinlay.name/notebook/sparse_coding.html
Tue, 05 Nov 2019 16:28:28 +0100https://danmackinlay.name/notebook/sparse_coding.htmlResources Wavelet bases Matching Pursuits Learnable codings Codings with desired invariances Misc Implementations Linear expansion with dictionaries of basis functions, with respect to which you wish your representation to be sparse; i.e. in the statistical case, basis-sparse regression. But even outside statistics, you wish simply to approximate some data compactly. My focus here is on the noisy-observation case, although the same results are recycled enough throughout the field.Optimal control
https://danmackinlay.name/notebook/optimal_control.html
Fri, 01 Nov 2019 12:58:54 +1100https://danmackinlay.name/notebook/optimal_control.htmlNuts and bolts Online Nothing to see here; I don’t do optimal control. But here are some notes for when I thought I might.
Feedback Systems: An Introduction for Scientists and Engineers by Karl J. Åström and Richard M. Murray is an interesting control systems theory course from Caltech.
The online control blog post mentioned below has a summary
Perhaps the most fundamental setting in control theory is a LDS is with quadratic costs \(c_t\) and i.Gradient descent, Higher order
https://danmackinlay.name/notebook/gd_3rd_order.html
Sat, 26 Oct 2019 16:56:24 +0800https://danmackinlay.name/notebook/gd_3rd_order.htmlNewton-type optimization uses 2nd-order gradient information (i.e. a Hessian matrix) to solve optimiztion problems. Higher order optimisation uses 3rd order gradients and so on. They are elegant for univariate functions.
This is rarely done in problems that I face because
3rd order derivatives of multivariate optimisations are usually too big in time and space complexity to be tractable They are not (simply) expressible as matrices so can benefit from a little tensor theory.Statistical learning theory for time series
https://danmackinlay.name/notebook/learning_theory_time_series.html
Tue, 01 Oct 2019 16:20:07 +1000https://danmackinlay.name/notebook/learning_theory_time_series.htmlStatistical learning theory for dependent data such as time series and possibly other dependency structures. But I only know about result for time series
Non-stationary, non-asymptotic bounds please. Keywords: Ergodic, α-, β-mixing.
Mohri and Kuznetsov have done lots of work here; See, e.g. their NIPS2016 tutorial. There seem to be a lot of types of ergodic/mixing results, about which I know as yet nothing. Notably (Kuznetsov and Mohri 2016, 2015) try to go beyond this setup.Wiener-Khintchine representation
https://danmackinlay.name/notebook/wiener_khintchine.html
Mon, 23 Sep 2019 13:47:08 +1000https://danmackinlay.name/notebook/wiener_khintchine.htmlDeterministic case Bochner’s Theorem \[ \renewcommand{\lt}{<} \renewcommand{\gt}{>} \renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\boldsymbol{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\ff}[1]{\mathfrak{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \renewcommand{\II}[1]{\mathbb{I}\{#1\}} \renewcommand{\inner}[2]{\langle #1,#2\rangle} \renewcommand{\Inner}[2]{\left\langle #1,#2\right\rangle} \renewcommand{\finner}[3]{\langle #1,#2;#3\rangle} \renewcommand{\FInner}[3]{\left\langle #1,#2;#3\right\rangle} \renewcommand{\dinner}[2]{[ #1,#2]} \renewcommand{\DInner}[2]{\left[ #1,#2\right]} \renewcommand{\norm}[1]{\| #1\|} \renewcommand{\Norm}[1]{\left\| #1\right\|} \renewcommand{\fnorm}[2]{\| #1;#2\|} \renewcommand{\FNorm}[2]{\left\| #1;#2\right\|} \renewcommand{\trn}[1]{\mathcal{#1}} \renewcommand{\ftrn}[2]{\mathcal{#1}_{#2}} \renewcommand{\Ftrn}[3]{\mathcal{#1}_{#2}\left\{\right\}} \renewcommand{\argmax}{\mathop{\mathrm{argmax}}} \renewcommand{\argmin}{\mathop{\mathrm{argmin}}} \renewcommand{\omp}{\mathop{\mathrm{OMP}}} \]
Consider a real-valued stochastic process \(\{X_t\}_{t\in\mathcal{T}}\) over an index metric space \(\mathcal{T}\) such as \(\mathcal{T}=\mathbb{R}^n\). i.e. any given realisation of such process is a function \(\mathcal{T}\to\mathbb{R}\).Wirtinger calculus
https://danmackinlay.name/notebook/wirtinger_calculus.html
Tue, 10 Sep 2019 10:02:46 +1000https://danmackinlay.name/notebook/wirtinger_calculus.htmlHow do you differentiate real-valued functions of complex arguments? Wirtinger calculus. This is a ridiculous hack that happens to work very well for signal processing over the complex field, especially in optimisation. It arises naturally in, for example, phase retrieval, (Zhang and Liang 2016; Candes, Li, and Soltanolkotabi 2015; Chen and Candès 2015; Seuret and Gouaisbaut 2013). Because of its area of popularity, this will almost surely arise in combination also of matrix calculus.Large sample theory
https://danmackinlay.name/notebook/large_sample_theory.html
Mon, 09 Sep 2019 12:52:51 +1000https://danmackinlay.name/notebook/large_sample_theory.htmlFisher Information Convolution Theorem Delta methods, influence functions, and so on. Convolution theorems, local asymptotic minimax theorems.
A convenient feature of M-estimation, and especially maximum likelihood esteimation is simple behaviour of estimators in the asymptotic large-sample-size limit, which can give you, e.g. variance estimates, or motivate information criteria, or robust statistics, optimisation etc.
In the most celebrated and convenient cases case asymptotic bounds are about normally-distributed errors, and these are typically derived through Local Asymptotic Normality theorems.Gradient descent, Newton-like
https://danmackinlay.name/notebook/gd_2nd_order.html
Tue, 03 Sep 2019 14:06:19 +1000https://danmackinlay.name/notebook/gd_2nd_order.htmlVanilla Newton methods Quasi Newton methods Hessian free Natural gradient descent. Stochastic Newton-type optimization, unlike basic gradient descent, uses (possibly approximate) 2nd-order gradient information to find the argument which minimises
\[ x^*=\operatorname{argmin}_{\mathbf{x}} f(x) \]
for some an objective function \(f:\mathbb{R}^n\to\mathbb{R}\).
Optimization over arbitrary functions typically gets discussed in terms of line-search and trust-region methods, both of which can be construed, AFAICT, as second order methods.Semidefinite proramming
https://danmackinlay.name/notebook/semidefinite_programming.html
Sat, 29 Jun 2019 16:33:47 +0200https://danmackinlay.name/notebook/semidefinite_programming.html “The most generalised version of convex programming”.
Semidefinite Programming L. Vandenberghe and S. Boyd
Convex Optimization and Euclidean Distance Geometry 2ε by Jon Dattoro
Fruend, Robert. n.d. “Introduction to Semidefinite Programming,” 51. Vandenberghe, Lieven, and Stephen P. Boyd. 1996. “Semidefinite Programming.” SIAM Review 38 (1): 49–95. https://web.stanford.edu/ boyd/papers/sdp.html. Optimisation
https://danmackinlay.name/notebook/optim.html
Thu, 27 Jun 2019 09:53:15 +0200https://danmackinlay.name/notebook/optim.htmlGeneral Brief intro material Textbooks To file Alternating Direction Method of Multipliers Optimisation on manifolds Gradient-free optimization “Meta-heuristic” methods Annealing and Monte Carlo optimisation methods Expectation maximization Parallel Implementations To file Miscellaneous optimisation techniques suggested on Linkedin Primal/dual problems Majorization-minorization Difference-of-Convex-objectives Crawling through alien landscapes in the fog, looking for mountain peaks.
I’m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.(Weighted) least squares fits
https://danmackinlay.name/notebook/least_squares.html
Wed, 22 May 2019 11:52:37 +1000https://danmackinlay.name/notebook/least_squares.htmlIteratively reweighted A classic. Surprisingly deep.
A few non-comprehensive notes to approximating by the arbitrary-but-convenient expedient of minimising the sum of the squares of the deviances.
As used in many many problems. e.g. lasso regression.
Nonlinear least squares with ceres-solver:
Ceres Solve is an open source C++ library for modeling and solving large, complicated optimization problems. It can be used to solve Non-linear Least Squares problems with bounds constraints and general unconstrained optimization problems.Wacky regression
https://danmackinlay.name/notebook/wacky_regression.html
Thu, 02 May 2019 16:21:05 +1000https://danmackinlay.name/notebook/wacky_regression.htmlI used to maintain a list of regression methods that were almost nonparametric, but as fun as that category was I was not actually suing it very often so I broke it up.
See bagging and bosting methods, neural networks, functional data analysis, gaussian process regression and randomised regression.
Fomel, Sergey. 2000. “Inverse B-Spline Interpolation.” Citeseer. http://www.reproducibility.org/RSF/book/sep/bspl/paper.pdf. Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.Nearly sufficient statistics
https://danmackinlay.name/notebook/nearly_sufficient_statistics.html
Mon, 14 Jan 2019 15:10:53 +1100https://danmackinlay.name/notebook/nearly_sufficient_statistics.htmlSufficient statistics in exponential families 🏗
I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.
By starting from the idea of sufficient statistics, we come to the idea of variational inference in a natural way, via some other interesting stopovers.
Consider the Bayes filtering setup.Variational state filtering
https://danmackinlay.name/notebook/state_filters_variational.html
Fri, 07 Dec 2018 12:39:45 +1100https://danmackinlay.name/notebook/state_filters_variational.htmlA placeholder; State filtering and estimation where the unobserved state and/or process noise are variationally-learned distributions. For now the only version that is even peripherally related to my work is the Gaussian process state filter.
Archer, Evan, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. 2015. “Black Box Variational Inference for State Space Models.” November 23, 2015. http://arxiv.org/abs/1511.07367. Bayer, Justin, and Christian Osendorfer. 2014. “Learning Stochastic Recurrent Networks.Optimisation, combinatorial
https://danmackinlay.name/notebook/optim_combinatorial.html
Sat, 11 Aug 2018 12:25:13 +1000https://danmackinlay.name/notebook/optim_combinatorial.htmlThis is not my manor, but occasionally a combinatorial optimisation problem arises that I would like to magically cause to vanish.
google’s OR tools do this for some problems.Submodular functions, maximizing
https://danmackinlay.name/notebook/submodular.html
Mon, 09 Jul 2018 09:11:00 +1000https://danmackinlay.name/notebook/submodular.htmlSubmodular functions arise in economics of multi-agent games and in various optimization problems that look like problems facing me.
Balkanski, Eric, and Yaron Singer. 2018a. “Approximation Guarantees for Adaptive Sampling,” 17. https://scholar.harvard.edu/files/ericbalkanski/files/approximation-guarantees-for-adaptive-sampling.pdf. ———. 2018b. “The Adaptive Complexity of Maximizing a Submodular Function,” 37. https://scholar.harvard.edu/files/ericbalkanski/files/the-adaptive-complexity-of-maximizing-a-submodular-function.pdf. Krause, Andreas, and Daniel Golovin. 2013. “Submodular Function Maximization.” In Tractability, edited by Lucas Bordeaux, Youssef Hamadi, Pushmeet Kohli, and Robert Mateescu, 71–104.Gradient descent, continuous, primal/dual formulations.
https://danmackinlay.name/notebook/gd_constrained.html
Mon, 07 Aug 2017 11:59:41 +1000https://danmackinlay.name/notebook/gd_constrained.htmlLagrange multipliers Duals Placeholder; I need to update this with real info, given how often I need to know it.
Lagrange multipliers Constrained optimisation using Lagrange’s one weird trick, and the Karush—Kuhn—Tucker conditions. The search for saddle points and roots.
Duals The types of optimisation problems you can create from a given set of constraints and objectives, based on primal and dual formulations.Lagrangian mechanics
https://danmackinlay.name/notebook/lagrangian_mechanics.html
Sun, 18 Jun 2017 08:01:47 +0800https://danmackinlay.name/notebook/lagrangian_mechanics.htmlApplied variational calculus with some physics.
I don’t really do physics these days, but physicists write the best introductions to variational calculus, which I do do.
“What is the quickest path through these mountains?” “Where do I put mountains such that my map produces the right quickest path?”
Optimal control, Pontryagin’s maximum principle. Variational calculus, minimising functionals. Deriving high dimensional functions as solutions of scalar optimisation using conservation principles as seen in functional data analysis and variational inference.Maximum likelihood inference
https://danmackinlay.name/notebook/maximum_likelihood.html
Thu, 13 Oct 2016 08:27:57 +1100https://danmackinlay.name/notebook/maximum_likelihood.htmlEstimator asymptotic optimality Fisher Information Fun features with exponential families Conditional transformation models the method of sieves Variants Conditional likelihood Marginal likelihood Profile likelihood Partial likelihood Pseudo-likelihood Quasi-likelihood H-likelihood M-estimation based on maximising the empirical likelihood with respect to the model by choosing the appropriate parameters appropriatedly.
See also expectation maximisation, information criteria, robust statistics, decision theory, all of machine learning, optimisation etc.Distributed optimization for regression
https://danmackinlay.name/notebook/distributed_statistics.html
Tue, 11 Oct 2016 12:09:57 +1100https://danmackinlay.name/notebook/distributed_statistics.htmlTools Placeholder; I have nothing to say about this right now, although I should metnion that message-passing algorithms based on variational inference nad graphical models are one possible avenue.
Tools Spark.
CoCOA
Acemoglu, Daron, Victor Chernozhukov, and Muhamet Yildiz. 2006. “Learning and Disagreement in an Uncertain World.” Working Paper 12648. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w12648. Battey, Heather, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu.M-estimation
https://danmackinlay.name/notebook/m_estimation.html
Mon, 10 Oct 2016 12:53:36 +1100https://danmackinlay.name/notebook/m_estimation.htmlImplied density functions Robust Loss functions Huber loss Hampel loss Fitting GM-estimators Loosely, estimating a quantity by choosing it to be the extremum of a function, or, if it’s well-behaved enough, a zero of its derivative.
Popular with machine learning, where loss-function based methods are ubiquitous. In statistics we see this famously in maximum likelihood estimation and robust estimation, and least squares loss, for which M-estimation provides a unifying formalism with a convenient large sample asymptotic theory.Generalised linear models
https://danmackinlay.name/notebook/glm.html
Wed, 31 Aug 2016 10:27:42 +1000https://danmackinlay.name/notebook/glm.htmlTODO Classic linear models Generalised linear models Response distribution Linear Predictor Link function Quaslilikelihood Hierarchical generalised linear models Generalised additive models Generalised additive models for location, scale and shape Generalised hierarchical additive models for location, scale and shape Generalised estimating equations Using the machinery of linear regression to predict in somewhat more general regressions, using least-squares or quasi-likelihood approaches. This means you are still doing something like familiar linear regression, but outside the setting of e.Statistical learning theory
https://danmackinlay.name/notebook/statistical_learning_theory.html
Tue, 16 Aug 2016 18:20:24 +1000https://danmackinlay.name/notebook/statistical_learning_theory.htmlMisc VC dimension Rademacher complexity## Stability-based PAC-learning Non-I.I.D data Stein’s method Misc, to file Image by Kareem Carr
Another placeholder far from my own background.
Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.