Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the behaviour of the models. Possibly a fruitful new front in the complexity of statistics.
As to how to scale up these models, see distributed gradient descent.
One fun result comes from Transformer language models. An interesting observation way back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. Indeed, there is a whole family of observations in this vein.
OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.
No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).
OpenAI also derives another a scaling law called L(C). This is the best you can do with compute C, if you spend it optimally.
What does optimal spending look like? Remember, you can spend a unit of compute on
- a bigger model (N), or
- training the same model for longer (S)
…In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.
The bitter lesson of history in AI is that “general methods that leverage computation are ultimately the most effective, and by a large margin.”
- Zhang et al. (2020) (how do NNs learn from language as n increases?