Garbled highlights from ICML 2017
2016-12-05 — 2017-08-11
Shambolic notes from ICML 2017, Sydney, Australia.
http://proceedings.mlr.press/v70/
1 Questions arising
- Objective smoothing, changing: can I do more of it?
2 Allen-Zhu, Tutorial on optimisation
2.1 Convex
Fenchel-Legendre Duality (convexity preservation)
Primal-dual formulation
Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization by Shai Shalev-Shwartz, Tong Zhang
An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization by Qihang Lin, Zhaosong Lu, Lin Xiao
In SGD setting, convex solvers…
2.1.1 primal
Variance reduction;
Katyusha momentum shrinks coordinates towards an infrequently updated point.
2.1.2 dual
Random coordinate descent
- Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization By Shai Shalev-Shwartz and Tong Zhang
epsilon^2 equiv /sqrt{T}
Note variance reduction comparison with Zdravko’s method.
2.2 Non-Convex
SVRG for finding local saddle points? Once again, shrinkage-like behaviour helps.
Neat quantification of non-convexity by magnitude of eigenvalues of Hessian. (complements convex parameter constraining positive eigenvalues)
(Be careful — everything thus far was about stationary points, not necessarily minimum.)
Local minima: effectively involves finding the most negative eigenvalue in the Hessian to leave the saddle point. “Second-order smoothness.”
This can be done randomly, by SGD, or even offline GD + perturbation.
Hessian-vector product via autodiff isn’t so complex.
2.3 Other matters arising
Generalising Q of variance reduction (better steps) versus momentum (longer good steps)
Why does coordinate descent work in dual formulation?
Objective smoothing normal
- But leads to an extra tuning parameter
- Can we simply reduce smoothing with time? Parallel with annealing
“One point convexity” — incoherence properties as conditions for a potentially nonconvex problem to be convex.
3 Moitra — Robust statistics
Total variation distance between true distribution and adversary-perturbed one. Minimax framing. Higher-order outlier detection. (What destroys our covariance?)
Mahalanobis distance bounding total variation.
Statistical Query Learning.
Part 2 more exciting — Belief Propagation, info-theoretical bounds. (Identifiability via information theory)
Kesten Stigum Bound.
Semidefinite (SDP) versus nonconvex.
Matrix factorization with noise.
Alternating minimisation is not robust to non-random additional information. Convex program still OK.
Nonconvex methods lack the robustness of convex relaxations.
4 Seq2Seq
Scheduled sampling.
XeXT — crossfade between cross-entropy and expected prediction loss.
Annealing methods again — blur the target distribution. This is a recurring theme.
5 Deep structured prediction workshop
https://deepstruct.github.io/ICML17/schedule/
Andrew McCallum (UMass), structured prediction energy networks.
Q: When can I replace a Markov network with a DAG by hidden factor?
Consider, e.g. Ising model: supporting structure must be large; The Markov sampler is one possible supporting network.
Analogy with “policy networks” or student forcing.
Can you train a model to sample from the desired posterior?
5.1 Dieterich Lawson, Filtering Variational objective
IWAE (importance-weighted autoencoder) is unsuited to sequential models. Their FIVO method is supposed to beat particle filters. High-dimensional hidden state.
5.2 Ryan Adams
Magic plug to get likelihoods/graphical modes into deep nets — Stochastic Variational Inference + Variational Autoencoder.
What is the natural gradient? What is the reparameterisation trick? Conditionally conjugate autodiff via proximal operators.
Low-dimensional latent structure; high-dimensional function approximation. I guess this is autoregressive-flow-like?
Building Probabilistic Structure into Massively Parameterized Models. Simultaneously “Finding a low dimensional manifold and dynamics constrained to that manifold”; comparison with sparse coding.
Koopman operators.
5.3 Sujith Ravi, Neural Graph Learning
https://research.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html
Expander: Google’s graph thingy.
“Transductive and inductive learning” too restrictive.
Streaming approximations to infer graphs. Loss functions including classifying new unlabelled data to be similar to existing labelled data.
Learn links rather than embeddings that attempt to summarize links (boo word2vec).
But I don’t think this uncovers graphs per se; it just uses interaction graphs as inputs.
Google’s Expander framework isn’t open source though.
Doesn’t need dense GPU algebra because it exploits structure.
“We are always in the low data situation because otherwise if we have lots of data we just increase the model complexity.”
Ryan Adams: CNNs are effectively priors on translation invariance. Structured prediction is a method for this; capturing other invariances.
6 Time series workshop
http://roseyu.com/time-series-workshop/#papers
6.1 Robert Bamler
Structured Black Box Variational Inference for Latent Time Series Models.
Connection to Archer et al.
Interesting explanation of the forward/backward Markov process.
Gauss-Markov model (GP?) works