It seems we should be able to do better than a gigantic network with millions of parameters. e @ChengSurvey2017 is a popular summary article of the state of the art theory and @ of the practice 2017. Question: how do you do this with recurrent neural networks?
Plainly, once we have trained the graph, how can we simplify it, compress it, or prune it? One model here is the “Student-Teacher” network, where you use one big network to train a little network, e.g. @UrbanDeep2016… Summarised by Tomasz Maliciekicz:
we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when they are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].
NB these networks are still bloody big, much bigger than we might hope.
This all seems intuitive, for the following hand-wavy reason: overparameterization is demonstrably important, and some “slack variables”for assimilating all the data they receive. However, when the network has reached a “good” optimum, some of those parameters are no longer needed; a much smaller representation of the manifold that each layer learned is probably available. But how much smaller?
This is suggestive of using some of the dimension reduction ideas such as mixture models, or whatever function approximation / matrix factorisation takes your fancy, to learn “good” approximation of each layer, once the overall network is trained.
Quantizing to fewer bits is another popular approach. (8 bits, 1 bit…)
Here is an interesting attempt to examine a related problem in reverse, and connect it to recurrent neural networks:
Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two. These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?
People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground – to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep RessimNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.
They mention also a version by Schmidhuber’s omnipresent lab: Compressed Network Search
Normally regularisation penalties are not used to reduce the overall size of a neural network. In matrix terms, they seem to do matrix sparsification but not matrix sketching.
See [@PanDropNeuron2016] for one attempt to drop neurons:
DropNeuron is aimed to train a small model from a large random initialized model, rather than compress or reduce a large trained model. DropNeuron can be mixed used with other regularization techniques, e.g. Dropout, L1, L2.