Figure 1

Leaked reports of unknown provenance (1, 2) suggest that GPT4, which is more or less a structured encoding of all public human knowledge (let’s call it the noösphere if not yet the Omega point), has about 1.8 trillion parameters. Assuming each parameter is a 16-bit float, that’s around 4 terabytes of data. With about 10 billion humans out there, we’re responsible for, on average, about 200 bytes of that knowledge each. In fact, there’s a lot of historical data in the training set, so let’s say half of it is historical. So, ballpark, we alive today may be responsible for 100 bytes each.

My question is: Which hundred bytes did you add to the sum of human knowledge?

You can answer it about me instead, if you’d like.

1 Taking this question actually seriously

This post is more of a thought experiment than a serious question, but we do have some tools that could help us get a quantitative answer.

Working out influence functions (in the sense of: which parts of the training data are most influential on this prediction) for a model like GPT4 is explored in Grosse et al. (). The question about which weights are most influenced by which training examples is a bit different and probably not intrinsically interesting except insofar as it tilts predictions and thus encodes more stuff into the model.

2 Incoming

  • Allen-Zhu and Li ():

    Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model’s capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model’s knowledge storage capacity.

3 References