Brief links on the theme of scaling in the extremely large model/large data limit and what that does to the behaviour of the models. Possibly a fruitful new front in the complexity of statistics.

As to *how* to scale up these models, see distributed gradient descent.

## Big transformers

One fun result comes from Transformer language models. An interesting observation way back in 2020 was that there seemed to be an unexpected trade-off where you can go faster by training a bigger network. Indeed, there is a whole family of observations in this vein.

nostalgebraist summarises Henighan et al. (2020); Kaplan et al. (2020):

## L(D): information

OpenAI derives a scaling law called L(D). This law is the best you could possibly do – even with arbitrarily large compute/models – if you are only allowed to train on D data points.

No matter how good your model is, there is only so much it can learn from a finite sample. L(D) quantifies this intuitive fact (if the model is an autoregressive transformer).

## L(C): budgeting

OpenAI also derives another a scaling law called L(C). This is the best you can do with compute C, if you spend it optimally.

What does optimal spending look like? Remember, you can spend a unit of compute on

- a bigger model (N), or
- training the same model for longer (S)
…In the compute regime we are currently in, making the model bigger is way more effective than taking more steps.

## Bitter lesson

The bitter lesson of history in AI is that “general methods that leverage computation are ultimately the most effective, and by a large margin.”

## Misc

- Zhang et al. (2020) (how do NNs learn from language as
*n*increases?

## No comments yet. Why not leave one?