Figure 1

The open-source AI scene has been kicking goals. People have wrestled models, datasets, and all the fixings away from the big-wigs. The final boss of that game is the access to expensive compute. Training a foundation model from scratch takes a warehouse full of GPUs that costs more than a small nation’s GDP. It’s been the one thing keeping AI development firmly in the hands of a few tech giants with cash to burn.

Until now, maybe. So, the citizen science equivalent for the NN age is to kudge together a supercomputer out of thousands of machines scattered across the globe? What if we could train a massive AI model using a network of gaming PCs, spare uni servers, and dusty rigs in garages?

1 DiSTrO

I first got a whiff of this when I heard about it from some bloke at an ML conference. He was a new hire at Nous Research, and their plan sounded completely off its rocker: training a proper large language model over the public internet.

Now, any network engineer will tell you that’s a stupid idea. An AI supercluster works because all the GPUs are in one room, hooked together with ridiculously fast, low-latency pipes. The internet, on the other hand, is a chaotic mess of slow, unreliable connections held together with sticky tape and hope. Trying to coordinate thousands of GPUs over it should be impossible at first glance.

Turns out, “impossible” is a dare to some people. The magic trick from the Nous lads is a bit of tech they call DisTrO (Distributed Training Over-the-Internet). Think of it as a super-smart compression algorithm for the training process. It cuts down the chatter needed between all the computers by a factor of something like 10,000. This means each machine can get on with its work without having to constantly check in, making the whole thing much feasible over your average internet connection.

And how do you get thousands of strangers to lend you their expensive GPUs and actually keep track of it all in your decentralised system? Crypto, of course. They use the Solana blockchain to create a transparent ledger. It coordinates the whole shebang and, more importantly, pays people for their GPU time. It’s not a charity; it’s a decentralized, transactional economy for compute power.

They’ve notionally already pulled it off, training a 15-billion-parameter model and live-streaming the process to prove it wasn’t smoke and mirrors.

2 Other players

As it turns out, Nous isn’t the only crew having a go. This whole idea of a decentralized “compute commons” has been brewing for a while.

In fact, the crypto world has been trying to build GPU marketplaces for years, even before the current AI boom. Projects like the Golem Network (one of the original Ethereum ICOs) and the Render Network (RNDR) pioneered the idea of letting people rent out their idle GPU power for tasks like CGI rendering. They laid the groundwork and proved a decentralized market was possible, even if they weren’t tackling the specific beast of LLM training.

Today, you’ve got a whole ecosystem of projects attacking this from different angles:

  • FLock.io is all about “Federated Learning,” a privacy-first approach where the model trains on your local data without you ever having to upload your data to them.
  • Gensyn and Prime Intellect are also building blockchain-based protocols to create a global, trustless marketplace for AI compute.
  • Meanwhile, outfits like Together AI are hammering away at the software side, releasing open-source tools that make distributed training more efficient for everyone.

3 Nuts and bolts

If you want to go down the rabbit hole on how this stuff actually works, here are a few starting points.

  1. Federated Learning (“Don’t Move the Data”): () The foundational paper from Google researchers is a good place to start to understand the core privacy-preserving concepts that many of these projects build on.

  2. Communication Compression. The core problem in distributed training is communication overhead. Vogels, Karimireddy, and Jaggi () is a classic example of the kind of algorithmic tricks used to dramatically reduce it. DisTrO from Nous is a modern evolution of these ideas.

  3. For a look at the software frameworks that make large-scale training manageable (even in a centralized data center), the blog post from Hugging Face and Microsoft on DeepSpeed is excellent. It explains concepts like model sharding in a clear way.

    Rajbhandari, S., et al. (2021). DeepSpeed: Extreme-scale model training for everyone. Microsoft Research Blog. Read it here.

4 References