Which are your 100 bytes?
Collective knowledge as training data
March 3, 2024 — March 3, 2024
Leaked reports of unknown provenance (1, 2) suggest that GPT4, which is more or less a structured encoding of all public human knowledge (let us call it the noösphere if not yet the Omega point) has about 1.8 trillion parameters. Assuming that each parameter is a 16-bit float, that is around 4 terabytes of data. With there being on the order of 10 billion humans out there, we are responsible for, on average, about 200 bytes of that knowledge each. In fact, there is a lot of historical data in the training set, so let’s say half of it is historical. So, ballpark, we alive today maybe are responsible for 100 bytes each.
My question is: Which hundred bytes did you add to the sum of human knowledge?
You can answer it instead about me, if you’d like.
1 Taking this question actually seriously
This post is more of a thought experiment than a serious question, but we do have some tools that could help us get a quantitative answer.
Working out influence functions (in this sense of: which parts of the training data are most influential on this prediction) for a model like GPT4 is explored in Grosse et al. (2023). The question about which weights are most influenced by which training examples is a bit different, and probably not intrinsically interesting except insofar as it tilts predictions and thus encodes more stuff into the model.