Memorization and retrieval in neural nets

Epistemomics

2024-03-03 — 2025-06-04

economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
technology
UI
Figure 1

Motivating question:

1 Which are your 100 bytes?

Leaked reports of unknown provenance (1, 2) suggest that GPT4, which is more or less a structured encoding of all public human knowledge — let’s call it the noösphere if not yet the Omega point — has about 1.8 trillion parameters. Assuming each parameter is a 16-bit float, that’s around 4 terabytes of data. With about 10 billion humans out there, we’re responsible for, on average, about 200 bytes of that knowledge each. In fact, there’s a lot of historical data in the training set, so let’s say half of it is historical. So, ballpark, we alive today may be responsible for 100 bytes each.

My question is: Which hundred bytes did you add to the sum of human knowledge?

You can answer it about me instead, if you’d like.

2 Taking this question actually seriously

The above is more of a thought experiment than a serious question, but we do have some tools that could help us get a quantitative answer.

NNs are famously great at interpolating between their training data, and then sometimes “extrapolating”. This notebook is dedicated to questions about the former process. Working out how and what they have memorized is a load-bearing part of understanding how NNs work, and presumably how to make them better.

Related: overparameterization, which are the non-obvious means by which networks memorize, scaling laws which we expect to be implicated…

3 Attribution

Working out influence functions (in the sense of: which parts of the training data are most influential on this prediction) for a model like GPT4 is explored in Grosse et al. (2023). The question about which weights are most influenced by which training examples is a bit different and probably not intrinsically interesting except insofar as it tilts predictions and thus encodes more stuff into the model.

4 Storage density

Allen-Zhu and Li (2024) estimates 2 bits of knowledge per parameter in a language model, in the limit:

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model’s capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model’s knowledge storage capacity.

cf Morris et al. (2025) who thinks it’s about 3.6 bits/parameter:

We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter.

They argue that this can be used to understand grokking.

5 Incoming

6 References

Allen-Zhu, and Li. 2024. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws.”
Grosse, Bae, Anil, et al. 2023. Studying Large Language Model Generalization with Influence Functions.”
Morris, Sitawarin, Guo, et al. 2025. How Much Do Language Models Memorize?