Modern computational neural network methods reascend the hype phase
transition. a.k.a *deep learning* or *double plus
fancy brainbots* or *please give the department have a bigger GPU budget
itβs not to play video games I swear*.

I donβt intend to write an introduction to deep learning here; that ground has been tilled already.

But here are some handy links to resources I frequently use and a bit of under-discussed background.

## What?

To be specific, deep learning is

- a library of incremental improvements in areas such as
Stochastic Gradient Descent,
approximation theory,
graphical models, and
signal processing research,
plus some handy advancements in
SIMD architectures
that, taken together, surprisingly elicit the
kind of results from machine learning that everyone was hoping weβd get by at
least 20 years ago, yet
*without*requiring us to develop substantially more clever grad students to do so, or, - the state-of-the-art in artificial kitten recognition.
- a metatstatizing buzzword

Itβs a frothy (some might say foamy-mouthed) research bubble right now, with such cuteness at the extrema as, e.g. Inceptionising inceptionism (Andrychowicz et al. 2016) which learns to learn neural networks using neural networks. (well, it sort of does that, but is a long way from a bootstrapping general AI) Stay tuned for more of this.

There is not much to do with βneuronsβ left in the paradigm at this stage. What there is, is a bundle of clever tricks for training deep constrained hierarchical predictors and classifiers on modern computer hardware. Something closer to a convenient technology stack than a single βtheoryβ.

Some network methods hew closer to behaviour of real neurons,
although not *that* close; simulating actual brains is
a different discipline
with only intermittent and indirect connection.

Subtopics of interest to me:

- recurrent networks for audio data
- compressing deep networks
- neural stack machines
- probabilistic learning
- generative models, esp for art

## Why bother?

There are many answers.

### The ultimate regression algorithm

β¦until the next ultimate regression algorithm.

It turns out that this particular learning model (class of learning models) and training technologies is surprisingly good at getting every better models out of ever more data. Why burn three grad students on a perfect tractable and specific regression algorithm when you can use one algorithm to solve a whole bunch of regression problems, and which improves with the number of computers and the amount of data you have? How much of a relief is it to capital to decouple its effectiveness from the uncertainty and obstreperousness of human labour?

### Cool maths

Function approximations, interesting manifold inference. Weird product measure things, e.g. (Montufar 2014).

Even the stuff Iβd assumed was trivial, like backpropagation, has a few wrinkles in practice. See Michael Nielsonβs chapter and Chrisopher Olahβs visual summary.

Yes, this is a regular paper mill. Not only are there probably new insights to
be had, but also you can recycle any old machine learning insight, replace
a layer in a network with that and *poof* β new paper.

### Insight into the mind

π Maybe.

There claims to be communication between real neurology and neural networks in computer vision, but elsewhere neural networks are driven by their similarities to other things, such as being differentiable relaxations of traditional models, (differentiable stack machines!) or of being license to fit hierarchical models without regard for statistical niceties.

There might be some kind of occasional βstylised factβ-type relationship.

### Trippy art projects

## Hip keywords for NN models

Not necessarily mutually exclusive; some design patterns you can use.

There are many summaries floating around. Some that I looked at are Tomasz Malisiewiczβs summary of Deep Learning Trends @ ICLR 2016, or the Neural network zoo or Simon Brugmanβs deep learning papers.

Some of these are descriptions of topologies, others of training tricks or whatever. Recurrent and convolutional are two types of topologies you might have in your ANN. But there are so many other possible ones: βGridβ, βhighwayβ, βTuringβ othersβ¦

Many are mentioned in passing in David McAllesterβs Cognitive Architectures post.

### Probabilistic/variational

### Convolutional

See the convnets entry.

### Generative Adversarial Networks

### Recurrent neural networks

Feedback neural networks structures to have with memory and a notion of time and βcurrentβ versus βpastβ state. See recurrent neural networks.

#### Grid and other axial tricks

A mini-genre. Kalchbrenner, Danihelka, and Graves (2016) connect recurrent cells across multiple axes, leading to a higher-rank MIMO system; This is natural in spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but apparently it was and it did.

### Transfer learning

I have seen two versions of this term.

One starts from the idea that if you have, say, a network that solves, some particular computer vision problem well, possibly you can use them to solve another one without starting from scratch on another computer vision problem.
This is the
*Recycling someone elseβs features* framing.
I donβt know why this has a special term -
I think itβs so that you can claim to do βend-to-endβ learning, but then
actually do what everyone else as done forever and works totally OK, which is to re-use other peopleβs work like real scientists.

The other version is you would like to do *domain adaptation*, which is to say, to *learn on one dataset but still make good predictions on a different dataset*.

These two things can clearly be related if you squint hard.
Using βtransfer learningβ in this second sense irritates me slightly because it already has so many names:
I would describe that problem as *external validity*, instead of *domain adaptation* but other names spotted in the wild include
*dataset shift*, *covariate shift*, *data fusion* and there are probably more.
This is a fundamental problem in statistics, and the philosophy of science generally, and has been for a long time.

### Attention mechanism

See Attention mechanism.

### Spike-based

Most simulated neural networks are based on a continuous activation potential and discrete time, unlike spiking biological ones, which are driven by discrete events in continuous time. There are a great many other differences (to real biology). What difference does this in particular make? I suspect it means that time is handled different.

### Kernel networks

Kernel trick + ANN = kernel ANNs.

(Stay tuned for reframing more things as deep learning.)

Is this what *convex networks* (Bengio et al. 2005) are?

AFAICT these all boil down to rebadged extensions of Gaussian processes but maybe Iβm missing something?

### Autoencoding

π Making a sparse encoding of something by demanding your network reproduces the after passing the network activations through a narrow bottleneck. Many flavours.

## Optimisation methods

Backpropagation plus stochastic gradient descent rules at the moment.

Does anything else get performance at this scale?
What other techniques can be extracted from variational inference
or MC sampling, or particle filters,
since there is no clear reason that shoving any of these in
as intermediate layers in the network
is any *less* well-posed than a classical backprop layer?
Although it does require more nous from the enthusiastic grad student.

## Preventing overfitting

## Activations for neural networks

## Implementing

## References

*Neural Computation*10 (2): 251β76.

*IEEE Transactions on Electronic Computers*EC-16 (3): 299β307.

*arXiv:1606.04474 [Cs]*, June.

*IEEE Computational Intelligence Magazine*5 (4): 13β18.

*Proceedings of The 28th Conference on Learning Theory*, 40:113β49. Paris, France: PMLR.

*arXiv:1412.8690 [Cs, Math, Stat]*, December.

*Proceedings of the National Academy of Sciences*113 (48): E7655β62.

*IEEE Transactions on Information Theory*39 (3): 930β45.

*arXiv:1611.03777 [Cs, Stat]*, November.

*Learning Deep Architectures for AI*. Vol. 2.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*35: 1798β828.

*Large-Scale Kernel Machines*34: 1β41.

*Advances in Neural Information Processing Systems*, 18:123β30. MIT Press.

*J. Solid State Circuits*26: 2017β25.

*arXiv:1706.04983 [Cs, Stat]*, June.

*PLoS Comp. Biol.*10: e1003963.

*arXiv:1511.05641 [Cs]*, November.

*arXiv Preprint arXiv:1409.1259*.

*Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics*, 192β204.

*J. Phys. Conf. Series*368: 012030.

*Neural Networks*32: 333β38.

*Mathematics of Control, Signals and Systems*2: 303β14.

*IEEE Transactions on Audio, Speech and Language Processing*20: 33β42.

*Advances in Neural Information Processing Systems 27*, 2933β41. Curran Associates, Inc.

*2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 6964β68. IEEE.

*Journal of Machine Learning Research*11 (Feb): 625β60.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*35: 1915β29.

*Neural Networks*13 (3): 317β27.

*Pattern Recognition*15 (6): 455β69.

*arXiv:1512.05287 [Stat]*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*26: 1408β23.

*arXiv:1508.06576 [Cs, q-Bio]*, August.

*arXiv:1412.5896 [Cs, Math, Stat]*, December.

*IEEE Transactions on Signal Processing*64 (13): 3444β57.

*arXiv:1606.05316 [Cs]*, June.

*arXiv:1412.6572 [Cs, Stat]*, December.

*arXiv:1412.6544 [Cs, Stat]*, December.

*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672β80. NIPSβ14. Cambridge, MA, USA: Curran Associates, Inc.

*2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 2:1735β42.

*Neuron*105 (3): 416β34.

*Advances in Neural Information Processing Systems*.

*Nature*500: 168β74.

*Science*268 (5214): 1558β1161.

*IEEE Signal Processing Magazine*29 (6): 82β97.

*Neural Networks: Tricks of the Trade*, 9:926. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg.

*Progress in Brain Research*, edited by Trevor Drew and John F. Kalaska Paul Cisek, Volume 165:535β47. Computational Neuroscience: Theoretical Insights into Brain Function. Elsevier.

*Science*313 (5786): 504β7.

*Neural Computation*18 (7): 1527β54.

*Neural Networks*2 (5): 359β66.

*2014 48th Asilomar Conference on Signals, Systems and Computers*.

*International Journal of Information Technology*11 (1): 16β24.

*International Journal of Machine Learning and Cybernetics*2 (2): 107β22.

*2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings*, 2:985β990 vol.2.

*Neurocomputing*, Neural Networks Selected Papers from the 7th Brazilian Symposium on Neural Networks (SBRN β04) 7th Brazilian Symposium on Neural Networks, 70 (1β3): 489β501.

*J. Physiol.*160: 106β54.

*arXiv:1608.05343 [Cs]*, August.

*arXiv:1511.08228 [Cs]*, November.

*arXiv:1507.01526 [Cs]*, January.

*arXiv:1010.3467 [Cs]*, October.

*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc.

*Advances in Neural Information Processing Systems*, 1097β1105.

*arXiv:1503.03167 [Cs]*, March.

*arXiv:1512.09300 [Cs, Stat]*, December.

*IEEE Transactions on Neural Networks*8: 98β113.

*Proceedings of the IEEE*86 (11): 2278β2324.

*Nature*521 (7553): 436β44.

*Predicting Structured Data*.

*Proceedings of the 26th Annual International Conference on Machine Learning*, 609β16. ICML β09. New York, NY, USA: ACM.

*IEEE Transactions on Information Theory*42 (6): 2118β32.

*Bioinformatics*30: i121β29.

*arXiv:1606.06737 [Cond-Mat]*, June.

*arXiv:1608.08225 [Cond-Mat, Stat]*, August.

*arXiv:1602.07320 [Cs]*, February.

*arXiv:1606.03490 [Cs, Stat]*.

*arXiv:1506.00019 [Cs]*, May.

*J. Chem. Inf. Model.*55: 263β74.

*Proceedings of the 32nd International Conference on Machine Learning*, 2113β22. PMLR.

*Communications on Pure and Applied Mathematics*65 (10): 1331β98.

*arXiv:1601.04920 [Cs, Stat]*, January.

*arXiv:1410.3831 [Cond-Mat, Stat]*, October.

*arXiv:1301.3781 [Cs]*, January.

*arXiv:1309.4168 [Cs]*, September.

*Nature*518: 529β33.

*IEEE Transactions on Audio, Speech, and Language Processing*20 (1): 14β22.

*Neural Networks*25 (January): 70β83.

*J. Discrete Math.*29: 321β47.

*ICASSP*.

*IEEE Transactions on Image Processing*14: 1360β71.

*Advances In Neural Information Processing Systems*.

*Network (Bristol, England)*7 (2): 333β39.

*Nature*381 (6583): 607β9.

*Current Opinion in Neurobiology*14 (4): 481β87.

*9th ISCA Speech Synthesis Workshop*.

*arXiv:1601.06759 [Cs]*, January.

*arXiv:1606.05328 [Cs]*, June.

*arXiv:1606.07326 [Cs, Stat]*, June.

*arXiv:1702.08360 [Cs]*, February.

*arXiv:1405.4604 [Cs]*, May.

*arXiv:1412.6621 [Cs, Stat]*, December.

*Acta Numerica*8 (January): 143β95.

*arXiv:1511.06434 [Cs]*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*35 (9): 2206β22.

*Advances in Neural Information Processing Systems 20*, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1185β92. Curran Associates, Inc.

*Nature*323 (6088): 533β36.

*arXiv:1412.6615 [Cs, Stat]*, December.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901β1. Curran Associates, Inc.

*arXiv:1607.00485 [Cs, Stat]*, July.

*arXiv:1701.06538 [Cs, Stat]*, January.

*arXiv:1703.00810 [Cs]*, March.

*arXiv:1702.04283 [Cs]*, February.

*Proceedings of International Conference on Learning Representations (ICLR) 2015*.

*Organic and functional nervous diseases; a text-book of neurology*. New York, Philadelphia, Lea & Febiger.

*arXiv:1507.02284 [Cs, Math, Stat]*, July.

*arXiv:1509.08101 [Cs]*, September.

*Neural Comput.*22: 511β38.

*arXiv:1603.05691 [Cs, Stat]*, March.

*Proceedings of IEEE International Symposium on Information Theory*.

*IEEE Transactions on Information Theory*64 (7): 1β1.

*arXiv:1611.03131 [Cs, Stat]*, November.

*IEEE Signal Processing Magazine*28 (1): 145β54.

*Proceedings of ICLR*.

*Advances In Neural Information Processing Systems*.

## No comments yet. Why not leave one?