Shambolic notes from ICML 2017, Sydney, Australia.

http://proceedings.mlr.press/v70/

1 Questions arising

  • Objective smoothing, changing: can I do more of it?

2 Allen-Zhu, Tutorial on optimisation

2.1 Convex

Fenchel-Legendre Duality (convexity preservation)

Primal-dual formulation

Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization by Shai Shalev-Shwartz, Tong Zhang

An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization by Qihang Lin, Zhaosong Lu, Lin Xiao

In SGD setting, convex solvers…

2.1.1 primal

Variance reduction;

Katyusha momentum shrinks coordinates towards an infrequently updated point.

2.1.2 dual

Random coordinate descent

  1. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization By Shai Shalev-Shwartz and Tong Zhang

epsilon^2 equiv /sqrt{T}

Note variance reduction comparison with Zdravko’s method.

2.2 Non-Convex

SVRG for finding local saddle points? Once again, shrinkage-like behaviour helps.

Neat quantification of non-convexity by magnitude of eigenvalues of Hessian. (complements convex parameter constraining positive eigenvalues)

(Be careful — everything thus far was about stationary points, not necessarily minimum.)

Local minima: effectively involves finding the most negative eigenvalue in the Hessian to leave the saddle point. “Second-order smoothness.”

This can be done randomly, by SGD, or even offline GD + perturbation.

Hessian-vector product via autodiff isn’t so complex.

2.3 Other matters arising

  • Generalising Q of variance reduction (better steps) versus momentum (longer good steps)

  • Why does coordinate descent work in dual formulation?

  • Objective smoothing normal

    • But leads to an extra tuning parameter
    • Can we simply reduce smoothing with time? Parallel with annealing
  • “One point convexity” — incoherence properties as conditions for a potentially nonconvex problem to be convex.

3 Moitra — Robust statistics

Total variation distance between true distribution and adversary-perturbed one. Minimax framing. Higher-order outlier detection. (What destroys our covariance?)

Mahalanobis distance bounding total variation.

Statistical Query Learning.

Part 2 more exciting — Belief Propagation, info-theoretical bounds. (Identifiability via information theory)

Kesten Stigum Bound.

Semidefinite (SDP) versus nonconvex.

Matrix factorization with noise.

Alternating minimisation is not robust to non-random additional information. Convex program still OK.

Nonconvex methods lack the robustness of convex relaxations.

4 Seq2Seq

Scheduled sampling.

XeXT — crossfade between cross-entropy and expected prediction loss.

Annealing methods again — blur the target distribution. This is a recurring theme.

5 Deep structured prediction workshop

https://deepstruct.github.io/ICML17/schedule/

Andrew McCallum (UMass), structured prediction energy networks.

Q: When can I replace a Markov network with a DAG by hidden factor?

Consider, e.g. Ising model: supporting structure must be large; The Markov sampler is one possible supporting network.

Analogy with “policy networks” or student forcing.

Can you train a model to sample from the desired posterior?

5.1 Dieterich Lawson, Filtering Variational objective

IWAE (importance-weighted autoencoder) is unsuited to sequential models. Their FIVO method is supposed to beat particle filters. High-dimensional hidden state.

5.2 Ryan Adams

Magic plug to get likelihoods/graphical modes into deep nets — Stochastic Variational Inference + Variational Autoencoder.

What is the natural gradient? What is the reparameterisation trick? Conditionally conjugate autodiff via proximal operators.

Low-dimensional latent structure; high-dimensional function approximation. I guess this is autoregressive-flow-like?

Building Probabilistic Structure into Massively Parameterized Models. Simultaneously “Finding a low dimensional manifold and dynamics constrained to that manifold”; comparison with sparse coding.

Koopman operators.

5.3 Sujith Ravi, Neural Graph Learning

https://research.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html

Expander: Google’s graph thingy.

“Transductive and inductive learning” too restrictive.

Streaming approximations to infer graphs. Loss functions including classifying new unlabelled data to be similar to existing labelled data.

Learn links rather than embeddings that attempt to summarize links (boo word2vec).

But I don’t think this uncovers graphs per se; it just uses interaction graphs as inputs.

Google’s Expander framework isn’t open source though.

Doesn’t need dense GPU algebra because it exploits structure.

“We are always in the low data situation because otherwise if we have lots of data we just increase the model complexity.”

Ryan Adams: CNNs are effectively priors on translation invariance. Structured prediction is a method for this; capturing other invariances.

6 Time series workshop

http://roseyu.com/time-series-workshop/#papers

6.1 Robert Bamler

Structured Black Box Variational Inference for Latent Time Series Models.

Connection to Archer et al.

Interesting explanation of the forward/backward Markov process.

Gauss-Markov model (GP?) works

7 References

Achab, Bacry, Gaïffas, et al. 2017. Uncovering Causality from Multivariate Hawkes Integrated Cumulants.” In PMLR.
Allen-Zhu. 2017. Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-Convex Parameter.” In PMLR.
Allen-Zhu, and Li. 2017a. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition.” In PMLR.
———. 2017b. Faster Principal Component Regression and Stable Matrix Chebyshev Approximation.” In PMLR.
———. 2017c. Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU.” In PMLR.
Allen-Zhu, Li, Singh, et al. 2017. Near-Optimal Design of Experiments via Regret Minimization.” In PMLR.
Anderson, and Gu. 2017. An Efficient, Sparsity-Preserving, Online Algorithm for Low-Rank Approximation.” In PMLR.
Archer, Park, Buesing, et al. 2015. Black Box Variational Inference for State Space Models.” arXiv:1511.07367 [Stat].
Avron, Kapralov, Musco, et al. 2017. Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees.” In PMLR.
Balduzzi, Frean, Leary, et al. 2017. The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question? In PMLR.
Bamler, and Mandt. 2017. Structured Black Box Variational Inference for Latent Time Series Models.” arXiv:1707.01069 [Cs, Stat].
Belilovsky, Kastner, Varoquaux, et al. 2017. Learning to Discover Sparse Graphical Models.” In PMLR.
Charikar, Steinhardt, and Valiant. 2016. Learning from Untrusted Data.” arXiv:1611.02315 [Cs, Math, Stat].
Chierichetti, Gollapudi, Kumar, et al. 2017. Algorithms for p Low-Rank Approximation.” In PMLR.
Cutajar, Bonilla, Michiardi, et al. 2017. Random Feature Expansions for Deep Gaussian Processes.” In PMLR.
Diakonikolas, Kamath, Kane, et al. 2016. Robust Estimators in High Dimensions Without the Computational Intractability.” arXiv:1604.06443 [Cs, Math, Stat].
Diakonikolas, Kamath, Kane, et al. 2017. Being Robust (in High Dimensions) Can Be Practical.” arXiv:1703.00893 [Cs, Math, Stat].
Domke. 2017. A Divergence Bound for Hybrids of MCMC and Variational Inference and an Application to Langevin Dynamics and SGVI.” In PMLR.
Engel, Resnick, Roberts, et al. 2017. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.
Ingraham, and Marks. 2017. Variational Inference for Sparse and Undirected Models.” In PMLR.
Jin, Ge, Netrapalli, et al. 2017. How to Escape Saddle Points Efficiently.” In PMLR.
Jing, Shen, Dubcek, et al. 2017. Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR.
Kocaoglu, Dimakis, and Vishwanath. 2017. Cost-Optimal Learning of Causal Graphs.” In PMLR.
Krishnan, Shalit, and Sontag. 2015. Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121.
———. 2017. Structured Inference Networks for Nonlinear State Space Models.” In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
Li. 2017. Robust Sparse Estimation Tasks in High Dimensions.” arXiv:1702.05860 [Cs].
Liu, Lugosi, Neu, et al. 2017. Algorithmic Stability and Hypothesis Complexity.” In PMLR.
Louizos, and Welling. 2017. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.
Maddison, Lawson, Tucker, et al. 2017. Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279.
Mei, Castro, Goude, et al. 2017. Nonnegative Matrix Factorization for Time Series Recovery From a Few Temporal Aggregates.” In PMLR.
Mhammedi, Hellicar, Rahman, et al. 2017. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR.
Moitra, Perry, and Wein. 2015. How Robust Are Reconstruction Thresholds for Community Detection? arXiv:1511.01473 [Cs, Math, Stat].
Naesseth, Linderman, Ranganath, et al. 2017. Variational Sequential Monte Carlo.” arXiv Preprint arXiv:1705.11140.
Pan, Yan, Theodorou, et al. 2017. Prediction Under Uncertainty in Sparse Spectrum Gaussian Processes with Applications to Filtering and Control.” In PMLR.
Taieb, Taylor, and Hyndman. 2017. Coherent Probabilistic Forecasts for Hierarchical Time Series.” In PMLR.
Telgarsky. 2017. Neural Networks and Rational Functions.” In PMLR.
Vaswani, Kveton, Wen, et al. 2017. Model-Independent Online Learning for Influence Maximization.” In PMLR.
Vorontsov, Trabelsi, Kadoury, et al. 2017. On Orthogonality and Learning Recurrent Networks with Long Term Dependencies.” In PMLR.