Distributed NN training

The People’s Supercomputer

2025-05-29 — 2025-11-22

Wherein is recounted the training of a 15‑billion‑parameter neural network across thousands of disparate machines, coordination and remuneration being effected via the Solana blockchain.

bounded compute

collective knowledge

concurrency hell

distributed

diy

economics

edge computing

extended self

faster pussycat

incentive mechanisms

innovation

language

machine learning

model selection

neural nets

NLP

optimization

premature optimization

statmech

technology

The open-source AI scene has been kicking goals. People have wrestled models, datasets, and all the fixings away from the bigwigs. The final boss is access to expensive compute. Training a foundation model from scratch requires a warehouse full of GPUs that costs more than a small nation’s GDP. It’s been the one thing keeping AI development firmly in the hands of a few tech giants with cash to burn.

Until now, maybe. So, the citizen science equivalent for the NN age is to kludge together a supercomputer out of thousands of machines scattered across the globe? What if we could train a massive AI model using a network of gaming PCs, spare uni servers, and dusty rigs in garages?

1 DiSTrO

I first got a whiff of this when I heard about it from some bloke at an ML conference. He was a new hire at Nous Research, and their plan sounded completely off its rocker: training a proper large language model over the public internet.

Now, any network engineer will tell us that’s a stupid idea. An AI supercluster works because all the GPUs are in one room, hooked together with ridiculously fast, low-latency pipes. The internet, on the other hand, is a chaotic mess of slow, unreliable connections held together with sticky tape and hope. Trying to coordinate thousands of GPUs over it should be impossible at first glance.

Turns out, “impossible” is a dare to some people. The magic trick from the Nous folks is a bit of tech they call DisTrO (Distributed Training Over-the-Internet). Think of it as a compression algorithm for the training process. It cuts down the network chatter between all the computers by roughly 10,000. That means each machine can get on with its work without constantly checking in, making the whole thing far more feasible over the average internet connection.

And how do we get thousands of strangers to lend us their expensive GPUs and actually keep track of it all in our decentralized system? Crypto, of course: it uses the Solana blockchain to coordinate and incentivize participation and GPU time.

They’ve arguably already pulled it off, training a 15-billion-parameter model and live-streaming the process to show it wasn’t smoke and mirrors.

2 Other players

I met a Nous guy at ICLR, so I saw some of their work. There are other players. This whole idea of a decentralized “compute commons” has been brewing for a while.

In fact, the crypto world has been trying to build GPU marketplaces for years, even before the current AI boom. Projects like the Golem Network (one of the original Ethereum ICOs) and the Render Network (RNDR) pioneered the idea of letting people rent their idle GPU power for tasks like CGI rendering. They laid the groundwork and proved a decentralized market could work, even if they weren’t tackling the specific beast of LLM training.

Today, we’ve got a whole ecosystem of projects attacking this from different angles:

FLock.io is all about “Federated Learning,” a privacy-first approach where models train on our local data so we don’t have to upload it.
Gensyn and Prime Intellect are also building blockchain-based protocols to create a global, trustless marketplace for AI compute.
Meanwhile, outfits like Together AI are hammering away at the software side, releasing open-source tools that make distributed training more efficient for everyone.

3 Nuts and bolts

If we want to go down the rabbit hole on how this stuff actually works, here are a few starting points.

Federated Learning (“Don’t Move the Data”): (McMahan et al. 2023) The foundational paper from Google researchers is a good place to start to understand the core, privacy-preserving concepts many of these projects build on.
Communication Compression. The core problem in distributed training is communication overhead. Vogels, Karimireddy, and Jaggi (2020) is a classic example of the kind of algorithmic tricks used to reduce it dramatically. DisTrO from Nous is a modern evolution of these ideas.
For a look at the software frameworks that make large-scale training manageable (even in a centralized data center), the blog post from Hugging Face and Microsoft on DeepSpeed is excellent. It explains concepts like model sharding in a clear way.

4 References

McMahan, Moore, Ramage, et al. 2023. “Communication-Efficient Learning of Deep Networks from Decentralized Data.”

Vogels, Karimireddy, and Jaggi. 2020. “PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization.”