Which are your 100 bytes?

Collective knowledge as training data

March 3, 2024 — March 3, 2024

faster pussycat
machine learning
neural nets
Figure 1

Leaked reports of unknown provenance (1, 2) suggest that GPT4, which is more or less a structured encoding of all of public human knowledge, has about 1.8 trillion parameters. Assuming that each parameter is a 16-bit float, that is around 4 terabytes of data. What with there being on the order of 10 billion humans out there, we are responsible for, on average, about 200 bytes of that knowledge each. In fact there is a lot of historical data in the training set, so let us that say half of it is historical. We alive today maybe are responsible for 100 bytes each.

My question is: Which hundred bytes did you add to the sum of human knowledge?

You can answer it instead about me if you’d like.

1 Taking this question actually seriously

This post is more of a thought experiment than a serious question, but we do have some tools that could help us get a quantitative answer.

Working out influence functions (in this sense of: which parts of the training data are most influential on this prediction) for a model like GPT4 is explored in Grosse et al. (2023). The question about which weights are most influenced by which training examples is a bit different, and probably not intrinsically interesting except insofar as it tilts predictions and thus encodes more stuff into the model.

2 References

Grosse, Roger, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, et al. 2023. Studying Large Language Model Generalization with Influence Functions.” arXiv.