Garbled highlights from NIPS 2016

2016-12-05 — 2017-02-03

Wherein the conference is traversed and sessions are catalogued, and Structured Orthogonal Random Features is presented, reducing kernel approximation time from O(d^2) to O(d log d) and speeding computation.

conference

neural nets

statistics

Full paper listing.

Snippets noted for future references

1 Time series workshop

Time series workshop home page, and the nonstationary time series tutorial with video.

Luminaries:

Mehryar Mohri
Yan Liu
Andrew Nobel
Inderjit Dhillon
Stephen Roberts

Vitaly Kuznetsov, Mehryar Mohri, introduced me to Learning theory for time series.

Mehryar Mohri presented his online-learning time series analysis using mixtures of experts through empirical discrepancy. He had me up until the model selection phase when I got lost in a recursive argument. Will come back to this.

Yan Liu — FDA approaches, Hawkes models, clustering of time series. Large section on subspace clustering, which I guess I need to comprehend at some point. Time is special because it reflects the arrow of entropy. Also, it can give us a notion of real causality.

Andrew B. Nobel- importance of mis-specification in time series models, wrt compounding of the problem over time, increased difficulty of validating assumptions. Time is special because it compounds error. P.s. why not more focus on algorithm failure cases? NIPS conference dynamic doesn’t encourage falsification.

Mohri: time is special because i.i.d is a special case thereof. “Prediction” really is about the future states with these. (How do you do inference of “true models” in his formalism?)

I missed the name of one Bayesian presenter, who asked:

Why not use DNN to construct features? How can the feature construction of DNNs be plugged into Bayesian models? BTW, Bayesian nonparametrics still state of the art for general time series.

2 Generative Adversarial models

Now well covered elsewhere.

3 MetaGrad: Multiple Learning rates in Online Learning

Tim van Erven, Wouter M Koolen

Learn correct learning rate by simultaneously trying many.

Question: Why is this online-specific?

4 Structured Orthogonal Random Features

I forget who presented F. X. Yu et al. (2016)

We present an intriguing discovery related to Random Fourier Features: replacing multiplication by a random Gaussian matrix with multiplication by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for its effectiveness. Motivated by the discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from \(\mathcal{O}(d^2)\) to \(\mathcal{O}(d \log d)\), where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF.

Leads naturally to question: How to manage other types of correlation. How about time series?

5 Universal Correspondence Network

I forgot who presented Choy et al. (2016), which integrates geometric transforms into CNNs in a reasonably natural way:

We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity.

Cries out for a musical implementation

6 Weight Normalization: A simple reparameterisation to Accelerate Training of Deep Neural Networks

Tim Salimans presents the simplest paper at NIPS, Salimans and Kingma (2016):

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterising the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

An elaborate motivation for a conceptually and practically simple way (couple of lines of code) of fixing up batch normalisation.

7 Relevant sparse codes with variational information bottleneck

Matthew Chalk presents Chalk, Marre, and Tkacik (2016).

In many applications, it is desirable to extract only the relevant aspects of data. A principled way to do this is the information bottleneck (IB) method, where one seeks a code that maximises information about a relevance variable, Y, while constraining the information encoded about the original data, X. Unfortunately however, the IB method is computationally demanding when data are high-dimensional and/or non-Gaussian. Here we propose an approximate variational scheme for maximising a lower bound on the IB objective, analogous to variational EM. Using this method, we derive an IB algorithm to recover features that are both relevant and sparse. Finally, we demonstrate how kernelised versions of the algorithm can be used to address a broad range of problems with non-linear relation between X and Y.

This one is a cool demo machine.

8 Dense Associative Memory for Pattern recognition

Dmitry Krotov presents Krotov and Hopfield (2016), a.k.a. Hopfield 2.0:

We propose a model of associative memory having an unusual mathematical structure. Contrary to the standard case, which works well only in the limit when the number of stored memories is much smaller than the number of neurons, our model stores and reliably retrieves many more patterns than the number of neurons in the network. We propose a simple duality between this dense associative memory and neural networks commonly used in models of deep learning. On the associative memory side of this duality, a family of models that smoothly interpolates between two limiting cases can be constructed. One limit is referred to as the feature-matching mode of pattern recognition, and the other one as the prototype regime. On the deep learning side of the duality, this family corresponds to neural networks with one hidden layer and various activation functions, which transmit the activities of the visible neurons to the hidden layer. This family of activation functions includes logistics, rectified linear units, and rectified polynomials of higher degrees. The proposed duality makes it possible to apply energy-based intuition from associative memory to analyze computational properties of neural networks with unusual activation functions — the higher rectified polynomials which until now have not been used for training neural networks. The utility of the dense memories is illustrated for two test cases: the logical gate XOR and the recognition of handwritten digits from the MNIST data set.

9 Density estimation using Real NVP

Laurent Dinh explains Dinh, Sohl-Dickstein, and Bengio (2016):

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.

This ultimately feeds into the reparameterisation trick literature.

10 InfoGAN: Interpretable Representation Learning by Information Maximising Generative Adversarial Nets

Xi Chen presents Chen et al. (2016)

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

Usable parameterizations of GAN by structuring the latent space.

11 Parameter Learning for Log-supermodular Distributions

Tatiana Shpakova presents Shpakova and Bach (2016).

Hack of note:

In order to minimize the expectation […], we propose to use the projected stochastic gradient method, not on the data as usually done, but on our own internal randomization.

12 Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates

Li, Liang, and Risteski (2016):

Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground- truth under fairly mild conditions. In particular, its only essential requirement on features is linear independence. Furthermore, the algorithm uses ReLU to exploit the non-negativity for decoding the weights, and thus can tolerate adversarial noise that can potentially be as large as the signal, and can tolerate unbiased noise much larger than the signal. The analysis relies on a carefully designed coupling between two potential functions, which we believe is of independent interest.

13 High dimensional learning with structure

High dimensional learning with structure page.

Luminaries:

Richard Samworth
Po-Ling Loh
Sahand Negahban
Mark Schmidt
Kai-Wei Chang
Allen Yang
Chinmay Hegde
Rene Vidal
Guillaume Obozinski
Lorenzo Rosasco

Several applications necessitate learning a very large number of parameters from small amounts of data, which can lead to overfitting, statistically unreliable answers, and large training/prediction costs. A common and effective method to avoid the above mentioned issues is to restrict the parameter-space using specific structural constraints such as sparsity or low rank. However, such simple constraints do not fully exploit the richer structure which is available in several applications and is present in the form of correlations, side information or higher order structure. Designing new structural constraints requires close collaboration between domain experts and machine learning practitioners. Similarly, developing efficient and principled algorithms to learn with such constraints requires further collaborations between experts in diverse areas such as statistics, optimization, approximation algorithms etc. This interplay has given rise to a vibrant research area.

The main objective of this workshop is to consolidate current ideas from diverse areas such as machine learning, signal processing, theoretical computer science, optimization and statistics, clarify the frontiers in this area, discuss important applications and open problems, and foster new collaborations.

Chinmay Hegde:

We consider the demixing problem of two (or more) high-dimensional vectors from nonlinear observations when the number of such observations is far less than the ambient dimension of the underlying vectors. Specifically, we demonstrate an algorithm that stably estimate the underlying components under general structured sparsity assumptions on these components. Specifically, we show that for certain types of structured superposition models, our method provably recovers the components given merely n = O(s) samples where s denotes the number of nonzero entries in the underlying components. Moreover, our method achieves a fast (linear) convergence rate, and also exhibits fast (near-linear) per-iteration complexity for certain types of structured models. We also provide a range of simulations to illustrate the performance of the proposed algorithm.

This ends up being a sparse recovery for given bases (e.g. Dirac deltas plus Fourier basis). The interesting problem is recovering the correct decomposition with insufficient incoherence (they have a formalism for this)

Rene Vidal: “Deep learning is nonlinear tensor factorization”. Various results on tensor factorization, regularized with various norms. They have proofs for a generalized class of matrix factorisations that “Sufficiently wide” factorization matrices do not have local minima. Conclusion: increase size of factorization, in optimisation procedure.

Guillaume Obozinski: hierarchical sparsity penalties for DAG inference.

14 Doug Eck

Presents magenta.

15 Computing with spikes workshop

computing with spikes home page.

16 Bayesian Deep Learning workshop

Bayesian Deep Learning workshop homepage.

17 NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop

NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop

18 Adaptive and Scalable Nonparametric Methods in Machine Learning

Looked solidly amazing, but I was caught up elsewhere:

Adaptive and Scalable Nonparametric Methods in Machine Learning

19 Brains and Bits: Neuroscience Meets Machine Learning

Especially curious about

Max Welling: Making Deep Learning Efficient Through Sparsification.

20 Spatiotemporal forecasting

homepage of NIPS workshop on ML for Spatiotemporal Forecasting.

21 Constructive machine learning

Rus Salakhutdinov

On Multiplicative Integration with Recurrent Neural Networks Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan R. Salakhutdinov

Constructive machine learning

22 References

Allen-Zhu, and Hazan. 2016. “Optimal Black-Box Reductions Between Optimization Objectives.” In Advances in Neural Information Processing Systems 29.

Ba, Hinton, Mnih, et al. 2016. “Using Fast Weights to Attend to the Recent Past.” In Advances in Neural Information Processing Systems 29.

Bhojanapalli, Neyshabur, and Srebro. 2016. “Global Optimality of Local Search for Low Rank Matrix Recovery.” In Advances in Neural Information Processing Systems 29.

Chalk, Marre, and Tkacik. 2016. “Relevant Sparse Codes with Variational Information Bottleneck.” In Advances in Neural Information Processing Systems 29.

Chen, Duan, Houthooft, et al. 2016. “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 29.

Choy, Gwak, Savarese, et al. 2016. “Universal Correspondence Network.” In Advances in Neural Information Processing Systems 29.

David, Moran, and Yehudayoff. 2016. “Supervised Learning Through the Lens of Compression.” In Advances in Neural Information Processing Systems 29.

Dinh, Sohl-Dickstein, and Bengio. 2016. “Density Estimation Using Real NVP.” In Advances In Neural Information Processing Systems.

Dumoulin, Shlens, and Kudlur. 2016. “A Learned Representation For Artistic Style.” arXiv:1610.07629 [Cs].

Ellis, Solar-Lezama, and Tenenbaum. 2016. “Sampling for Bayesian Program Learning.” In Advances in Neural Information Processing Systems 29.

Finn, Goodfellow, and Levine. 2016. “Unsupervised Learning for Physical Interaction Through Video Prediction.” In Advances In Neural Information Processing Systems 29.

Flamary, Févotte, Courty, et al. 2016. “Optimal Spectral Transportation with Application to Music Transcription.” In arXiv:1609.09799 [Cs, Stat].

Fraccaro, Sø nderby, Paquet, et al. 2016. “Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29.

Ge, Lee, and Ma. 2016. “Matrix Completion Has No Spurious Local Minimum.” In Advances in Neural Information Processing Systems 29.

Genevay, Cuturi, Peyré, et al. 2016. “Stochastic Optimization for Large-Scale Optimal Transport.” In Advances in Neural Information Processing Systems 29.

Gruslys, Munos, Danihelka, et al. 2016. “Memory-Efficient Backpropagation Through Time.” In Advances in Neural Information Processing Systems 29.

Haarnoja, Ajay, Levine, et al. 2016. “Backprop KF: Learning Discriminative Deterministic State Estimators.” In Advances in Neural Information Processing Systems 29.

Haeffele, and Vidal. 2015. “Global Optimality in Tensor Factorization, Deep Learning, and Beyond.” arXiv:1506.07540 [Cs, Stat].

Hazan, and Ma. 2016. “A Non-Generative Framework and Convex Relaxations for Unsupervised Learning.” In Advances in Neural Information Processing Systems 29.

He, Xu, Kempe, et al. 2016. “Learning Influence Functions from Incomplete Observations.” In Advances in Neural Information Processing Systems 29.

Horel, and Singer. 2016. “Maximization of Approximately Submodular Functions.” In Advances in Neural Information Processing Systems 29.

Jia, De Brabandere, Tuytelaars, et al. 2016. “Dynamic Filter Networks.” In Advances in Neural Information Processing Systems 29.

Kingma, Salimans, Jozefowicz, et al. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.

Krotov, and Hopfield. 2016. “Dense Associative Memory for Pattern Recognition.” In Advances in Neural Information Processing Systems 29.

Krummenacher, McWilliams, Kilcher, et al. 2016. “Scalable Adaptive Stochastic Optimization Using Random Projections.” In Advances in Neural Information Processing Systems 29.

Kuznetsov, and Mohri. 2014a. “Forecasting Non-Stationary Time Series: From Theory to Algorithms.”

———. 2014b. “Generalization Bounds for Time Series Prediction with Non-Stationary Processes.” In Algorithmic Learning Theory. Lecture Notes in Computer Science.

———. 2015. “Learning Theory and Algorithms for Forecasting Non-Stationary Time Series.” In Advances in Neural Information Processing Systems.

———. 2016. “Generalization Bounds for Non-Stationary Mixing Processes.” In Machine Learning Journal.

Li, Liang, and Risteski. 2016. “Recovery Guarantee of Non-Negative Matrix Factorization via Alternating Updates.” In Advances in Neural Information Processing Systems 29.

Lindgren, Wu, and Dimakis. 2016. “Leveraging Sparsity for Efficient Submodular Data Summarization.” In Advances in Neural Information Processing Systems 29.

Luo, Agarwal, Cesa-Bianchi, et al. 2016. “Efficient Second Order Online Learning by Sketching.” In Advances in Neural Information Processing Systems 29.

Makoto Yamada, Koh Takeuchi, Tomoharu Iwata, et al. 2016. “Localized Lasso for High-Dimensional Regression.” In.

Mohammadreza Soltani, and Chinmay Hegde. 2016. “Iterative Thresholding for Demixing Structured Superpositions in High Dimensions.” In.

Ostrovsky, Harchaoui, Juditsky, et al. 2016. “Structure-Blind Signal Recovery.” In Advances in Neural Information Processing Systems 29.

Poole, Lahiri, Raghu, et al. 2016. “Exponential Expressivity in Deep Neural Networks Through Transient Chaos.” In Advances in Neural Information Processing Systems 29.

Ritchie, Thomas, Hanrahan, et al. 2016. “Neurally-Guided Procedural Models: Amortized Inference for Procedural Graphics Programs Using Neural Networks.” In Advances in Neural Information Processing Systems 29.

Salimans, and Kingma. 2016. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In Advances in Neural Information Processing Systems 29.

Schein, Wallach, and Zhou. 2016. “Poisson-Gamma Dynamical Systems.” In Advances In Neural Information Processing Systems.

Shpakova, and Bach. 2016. “Parameter Learning for Log-Supermodular Distributions.” In Advances in Neural Information Processing Systems 29.

Sinha, and Duchi. 2016. “Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29.

Soltani, and Hegde. 2016a. “Demixing Sparse Signals from Nonlinear Observations.” Statistics.

———. 2016b. “Fast Algorithms for Demixing Sparse Signals from Nonlinear Observations.” arXiv:1608.01234 [Stat].

Surace, and Pfister. 2016. “Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.” In.

van Erven, and Koolen. 2016. “MetaGrad: Multiple Learning Rates in Online Learning.” In Advances in Neural Information Processing Systems 29.

Wang, Xu, You, et al. 2016. “CNNpack: Packing Convolutional Neural Networks in the Frequency Domain.” In Advances in Neural Information Processing Systems 29.

Wu, Shanshan, Bhojanapalli, Sanghavi, et al. 2016. “Single Pass PCA of Matrix Products.” In Advances in Neural Information Processing Systems 29.

Wu, Yuhuai, Zhang, Zhang, et al. 2016. “On Multiplicative Integration with Recurrent Neural Networks.” In Advances in Neural Information Processing Systems 29.

Yuan, Li, Zhang, et al. 2016. “Learning Additive Exponential Family Graphical Models via \(\ell_{\lbrace 2,1\rbrace}\) -Norm Regularized M-Estimation.” In Advances in Neural Information Processing Systems 29.

Yu, Hsiang-Fu, Rao, and Dhillon. 2016. “Temporal Regularized Matrix Factorization for High-Dimensional Time Series Prediction.” In Advances In Neural Information Processing Systems 29.

Yu, Felix X, Suresh, Choromanski, et al. 2016. “Orthogonal Random Features.” In Advances in Neural Information Processing Systems 29.

Zhang, and Liang. 2016. “Reshaped Wirtinger Flow for Solving Quadratic System of Equations.” In Advances in Neural Information Processing Systems 29.

zhang, Lin, Lin, et al. 2016. “Infinite Hidden Semi-Markov Modulated Interaction Point Process.” In Advances in Neural Information Processing Systems 29.