DIY Generative AI
Community resource and epistemological infrastructure
2024-06-15 — 2026-05-11
In Which the Supply Chain of Open Generative Models Is Decomposed Into Constituent Layers, From Volunteer-Scraped Datasets to Inference Code Running on Consumer Laptops and the Remarkable Facility With Which Communities Are Hacking on the Whole Stack Is Noted.
Despite the formidable barriers — technical know-how, expensive infrastructure, etc. — communities are figuring out how to DIY large generative models. Since I grew up on cypherpunk DIY myself, I can’t help being captivated.
1 Folk History of Open-Source Generative AI
The drive to democratize generative AI is a response to a field initially dominated by a few large, well-funded technology companies. Early, powerful models were often kept proprietary, accessible only through paid APIs. But hackers wanna hack.
Building a large neural network takes more than a shell account and a copy of Phrack. The ingredients have their own supply chains, and the open-source AI story is largely about which ones the community can source, share, or substitute for, and which still bottleneck on a few well-funded labs. The word “open” gets stretched across most of them indiscriminately, so it pays to keep them apart.
- Compute
- Big GPUs in the right configuration. The hardest piece to substitute, and the one most tightly coupled to capital and geopolitics. Communities pool, rent, beg, or scavenge, but mostly we wait for someone else’s training run to finish and then download the weights and tweak them a bit.
- Data
- Text, images, video, code at terabyte-to-petabyte scale, mostly scraped from the open web. LAION and EleutherAI’s The Pile are the canonical open-data efforts; Common Crawl is the underlying substrate for almost everyone, open and closed. Almost nobody publishes their final training mix — too much copyright exposure, too much competitive advantage.
- Algorithms
- Used to be the most-published ingredient. Foundational techniques — transformer, diffusion, MoE, attention variants, RLHF basics — got papers, often early. That has changed. As the field commercialised, the US frontier labs (OpenAI, Anthropic, Google) stopped publishing the specifics of what makes their latest models work, and we have no way of telling from outside how much of the algorithmic moat is novelty vs. better execution of known ideas vs. quietly-better data. The Chinese open-weights labs seem to be publishing more methodology than the closed Western ones now, which AFAICT inverts the previous decade’s pattern. Reading the paper, when there is one, is necessary but nowhere near sufficient.
- Training code and recipes
- Knowing the algorithm and being able to run it at scale on real hardware are different things, because training big models is hard. A handful of labs (AI2 with OLMo, BigScience with BLOOM, EleutherAI with Pythia, increasingly DeepSeek and HuggingFace’s SmolLM) publish the full recipe — data mix, hyperparameters, scaling decisions, the lot. Most release only the weights, leaving the kitchen door closed. “Open weights” is not “open source”.
- Trained weights
- The visible deliverable. Llama, Qwen, DeepSeek, Mistral, Flux dev, SDXL — what many people mean when they say “open-source AI”, though the term conflates several distinct things. A weight file is a frozen artefact; given enough compute and the right framework, anyone can run it, but reproducing or even understanding it from scratch is a different problem.
- Inference code
-
Once weights exist, the code that runs them is by now commodity.
llama.cpp,vLLM,ollama, ComfyUI, DiffusionBee — fast, portable, hackable, often community-maintained, and frequently better than whatever the model’s authors shipped. A laptop runs a multibillion-parameter model with code that nobody at the original training lab wrote. This is the layer where DIY has most thoroughly won. - Expertise
- The scarcest ingredient and the one that does not transfer with a download. Knowing which loss curve to trust, which dataset deduplication is load-bearing, when a model is broken vs. just undertrained, when to throw out a run and when to push through. Concentrated in a few labs, leaking slowly through papers, blog posts, fired employees, and the occasional indiscreet Discord.
1.1 Image generation
When models like DALL-E first appeared, they lived in a corporate walled garden. The 2022 release of Stable Diffusion opened things up — a potent, open-weights model that let anyone with a decent gaming PC generate high-quality images. Compute came from Stability AI; the architecture and training methodology were published as papers; the dataset came from LAION, a non-profit that had done the hard work of scraping and aligning image-text pairs at scale; inference clients sprang up from the community in short order. All four ingredients in public, even if the training recipe was seemingly less tidy than the model card implied.
In 2024 a chunk of the original Stable Diffusion researchers walked out of Stability AI over disagreements about open-source commitments and commercialization strategy, regrouped as Black Forest Labs, and shipped the Flux family — a transformer-based model released as a “dev” version (open-weights, non-commercial) and a “pro” version (paid API). Their editing variant, FLUX.1 Kontext, is AFAICT the first open-weights edit model that does not feel like a toy. The fault line keeps reappearing in slightly different forms: corporate models (Flux pro, DALL-E 3, Midjourney) offer plug-and-play polish but lock advanced features behind APIs; community-tuned ones (SDXL, the CivitAI menagerie) offer customization, autonomy, and the ability to coax out things the mainstream models’ guardrails won’t — idiosyncratic styles, violence, sexual content, nazis. Black Forest hedges: open weights for the dev release, paid API for the pro version.
1.2 Language model pirates
When GPT-3 was kept under lock and key, a scrappy crew of volunteer researchers on a Discord server formed EleutherAI in 2020. Their explicit goal was to replicate the model and give it away. They bootstrapped their own massive dataset, The Pile, released GPT-Neo, GPT-J, and GPT-NeoX-20B in turn, and demonstrated that a distributed collective could take on the giants — at least in the lighter weight classes. Meta later started open-weights releases of much more powerful models, beginning with LLaMA (initially leaked rather than released; Meta has since seemingly embraced the leak as strategy), shifting the open-weights frontier into the multi-billion-parameter range. AFAICT EleutherAI got the open-LLM ball rolling, Meta made it big, Mistral added a European actor — and then the centre of gravity shifted again.
1.3 Chinese open-weights
In 2024, Chinese labs decided there was no particular reason to follow the script that the frontier is closed; open weights catch up a year or two later. Alibaba’s Qwen series, DeepSeek-V2/V3 and the R1 reasoning model, Moonshot’s Kimi, Zhipu’s GLM, 01.AI’s Yi — and a steady drip of smaller releases from Tencent (Hunyuan), ByteDance, MiniMax, StepFun, and others — arrived with permissive licences, fluent in English and Chinese, at scales and quality matching or beating closed Western models on many benchmarks. Maybe. Given US export controls on high-end GPUs, aggressive open-sourcing has been widely read as a baller geopolitical play as much as a research strategy — recruit downstream users, distribute the costs of fine-tuning and tooling, build legitimacy outside the export-control-and-API regime, commoditize the layer above whoever is selling the chips.
DeepSeek’s R1 in January 2025 upset the apple cart — an open-weights reasoning model competitive with OpenAI’s o1, released with a paper explaining the RL training recipe in unusual detail. It cratered Nvidia’s market cap for a week and sent every Western lab into a hurried “we are also open” press cycle. Whether the architectural and training-cost claims fully replicate or not1, the availability changed expectations. “Open” no longer means “a year behind”.
These labs differ in how much they reveal upstream of the weights. DeepSeek seems to publish more methodology and partial training detail than most; the rest release weights, a model card, and not much more. Training data: essentially never. Same conflation as everywhere else — but the closed-model players seemingly cannot use “we’re so far ahead it doesn’t matter” as a rebuttal any more.
1.4 Sovereign and public-good initiatives
A smaller-but-interesting cluster of state-funded open-weights efforts exists that is neither the US-commercial nor the Chinese-commercial model — non-superpower governments that decided this was strategically worth doing themselves. Switzerland’s Apertus (ETH, EPFL, and the Swiss National Supercomputing Centre, September 2025) actually publishes the whole recipe — weights, training data, methods — built for the public good with serious effort on Swiss German, Romansh, and other long-underserved languages. Singapore’s SEA-LION (via AI Singapore) does the equivalent for 11 Southeast Asian languages — Thai, Burmese, Lao, Khmer, Tamil, Filipino, and others the global frontier models handle poorly. The UAE’s Falcon family from TII fits the same pattern, and France’s Jean Zay supercomputer was behind BigScience’s BLOOM (covered earlier). These projects usually trail the commercial frontier but encode different priorities — multilingualism, regional language coverage, public-interest licensing — that the commercial labs do not optimise for.
The same logic scales down to mutual-aid-sized collectives, where what the Chinese open-weights wave actually buys us is the ability to host the supply chain locally. I’ve been costing out a 50-person Australian compute club that would run such models on our own hardware — sovereign AI as friendly-society infrastructure rather than national-strategic project. Follow along at danmackinlay/SOV.
1.5 Hacking on a shoestring
Buzzword: Jugaad, a kind of frugal, resourceful, get-it-done innovation that thrives when there is no billion-dollar data centre to work with. Make do, hack together what’s at hand. Popular in community Discords, where builders DIY for genAI. Especially hot with respect to globalizing AI, breaking it free from geographically concentrated power. The poster child for Jugaad might be Masakhane, a grassroots, pan-African collective building NLP models for African languages that Big Tech ignores. Their name literally means “we build together,” and they operate as a distributed network of researchers — foundational work without a centralised HQ. The same spirit shows up in applied AI like Kenyan startup Tenakata, which uses AI on basic mobile-phone data to create credit scores for smallholder farmers with no formal banking history. Not a new foundation model; a clever, resource-light hack at a pressing problem. Innovation by necessity, not excess.
1.6 Fandoms
A reliable driver for weird obsession-heavy project is… nerds who like weird stuff. Fandoms are tried-and-tested distributed innovation communities, and even for AI this is a thing.
First, the My Little Pony fandom. When Friendship is Magic ended in 2019, a segment of the fanbase decided they just weren’t done with it. This led to the grassroots Pony Preservation Project, which originated on 4chan’s /mlp/ board with a clear mission to meticulously collect, clean, and align voice data from every episode. Not just archiving — building a high-quality dataset specifically for AI training. That dataset became a key component for 15.ai, a landmark text-to-speech platform that launched in 2020 and let anyone generate shockingly accurate dialogue in the characters’ voices. That led to much fun and equally much drama. The result was an explosion of fan-made content, including fully AI-voiced animations, effectively continuing the show through sheer force of will and technical ingenuity.
Second, communities like AI HUB. AI HUB became the go-to Discord community for the ethically murky world of cloning the voices of actual musicians and other public figures. The community doesn’t just use voice cloning models; it’s a hotbed for sharing and developing the tools themselves — pre-trained models of specific singers, new techniques for separating vocals from tracks to make clean training data, training pipelines, sometimes even algorithms. A decentralised R&D and distribution network that lowers the barrier to entry for a controversial and technically demanding application of AI.
Porn is probably in the same category, but I cannot even.
Fan projects frequently test the edges of copyright law, and fans often have smaller budgets than copyright holders or large tech companies, so these projects either get mired in drama or keep a low profile.
1.7 A mandatory AI safety qualm
As with all DIY, this stuff is definitely dangerous. Insofar as empowering people with more tools makes people more powerful, democratizing it puts it in the hands of nice, disempowered people and terrorists. Except these machines might themselves eventually autonomously behave nicely, or terroristically.
There is also a danger in letting powerful models be in the hands of only a small number of geopolitical actors.
Our risk model should take these into account — and now we are having a conversation about broader risks and d/Acc.
2 The Players
A motley crew of non-profits, academic-adjacent labs, grassroots collectives, and a few companies playing the open-source game.
2.1 Hugging Face
If this scene has a distro outlet, it must be Hugging Face. They started as a chatbot company and pivoted into being a central repository for much of open-source AI — models, datasets, and the transformers library that lets us actually use the stuff without pulling our hair out. AFAICT every other organization on this list, from EleutherAI to DeepSeek, hosts at least some models there. A Schelling point of sorts.
2.2 CivitAI
Where Hugging Face is general-purpose, CivitAI is a specialised marketplace for the diffusion-model fine-tuning community. It hosts the LoRAs, textual inversions, full fine-tunes, and aesthetic gradients that power the long tail of niche styles — anime, photorealism, particular painters, particular fictional characters, particular kinks. Visual browsing with prompt-and-output previews; simple download; an API that the local diffusion clients (DiffusionBee, Draw Things, ComfyUI, …) plug into. More permissive content policy than Hugging Face, which is part of the appeal and part of why some people are wary. Pony Diffusion V6 XL — an SDXL fine-tune trained on character art from booru-style databases, with no RLHF passes applied — is a concrete example of the community making explicit choices against the alignment priorities of production models. Its name is unrelated to the MLP voice-synthesis projects elsewhere on this page; the downstream ecosystem of hundreds of derivative fine-tunes has become the main habitat for image generation that handles negative affect, dramatic poses, and creature design outside the cheery defaults. See the project about page.
2.3 Allen Institute for AI (AI2)
Bankrolled by the late Microsoft co-founder Paul Allen. AI2 is a non-profit that tackles long-term research that corporations apparently won’t and universities apparently can’t. They built foundational tools like AllenNLP for language research and the Semantic Scholar search engine. With their OLMo model they went further than most and released not just the weights but the training data and code as well — not just the cooked meal, the whole recipe.
2.4 EleutherAI
A grassroots collective born on Discord around 2020 that replicated GPT-3-class models and gave them away. They scraped the web to assemble The Pile dataset when, AFAICT, nobody else was doing it openly at that scale. Without their proof-of-concept that a distributed volunteer effort could produce credible LLMs, much of what followed seems unlikely to have happened — although counterfactuals are hard.
2.5 LAION
A German non-profit of data hoarders for the people, with a simple monumental mission — create and release the massive datasets that AI models need to learn from. LAION-5B (5.8 billion image-text pairs) is their best-known release. Models need food; LAION provides a public buffet. Without their work, no Stable Diffusion. They sit upstream of much of the open image-generation pipeline.
2.6 BigScience
Not a permanent group but a year-long “Woodstock for AI researchers” (this punk metaphor is getting overwrought, sorry). Coordinated by Hugging Face, it brought over 1,000 researchers together in a global jam session to build BLOOM, a massive multilingual language model. A proof of concept for a different way of doing research — open, global, collaborative — and a counter-narrative to the idea that only secretive corporate labs could build state-of-the-art models. AFAICT nobody has repeated it at the same scale since.
2.7 Chinese open-weights labs
As of mid-2026 the roster includes DeepSeek on coding, math, and reasoning — R1 in January 2025 was the moment this stopped being a niche story. Qwen (Alibaba) does broad-purpose multilingual and is apparently the most-downloaded open-weights family on HuggingFace by some margin. Kimi (Moonshot) covers long-context and agentic tasks. GLM (Zhipu, out of Tsinghua’s KEG lab) covers bilingual chat. Yi (01.AI, Kai-Fu Lee’s outfit) does efficient general-purpose models. Plus a long tail of smaller releases from Tencent, ByteDance, MiniMax, StepFun, and others. The pecking order shifts every quarter.
These labs have shifted the global default for what an open-weights model costs and what it can do, and AFAICT shifted who the protagonists of the open-source story are. The answer in 2026 reads less like “American hobbyist hackers” or “European research consortia” than “Chinese commercial labs with strategic reasons to give weights away” — export-control geopolitics, market structure (no domestic OpenAI-equivalent to undercut), and the familiar move of commoditizing the layer above whoever is selling the chips and the API. Releases are mostly weights-plus-model-card; DeepSeek seems to be the outlier in publishing methodology; training data is essentially never released by anyone.
2.8 DAIR Institute
Founded by Dr. Timnit Gebru after her high-profile exit from Google. An independent research institute, claiming “We publish interdisciplinary work uncovering and mitigating the harms of current AI systems, and research, tools and frameworks for the technological future we should build instead.” It aims to be a conscience of the scene, providing critique and ethical frameworks that try to keep the community from rebuilding the same exploitative systems it claims to be escaping.
2.9 Agora
A hyper-grassroots collective of engineers and creators organised on Discord, seemingly driven by individuals like Kye Gomez. Focus on the bleeding edge — multi-agent systems, real-time learning, augmenting AI reasoning — and radically open in the share-before-it-works mode, in service of accelerating collaboration at the fringes. I bet there are other projects like Agora; it has the distinction of being the one I ran into.
2.10 Oumi
Where others provide finished models, Oumi is building a platform and Python library to streamline the entire process of rolling our own — curating data, fine-tuning, deploying. If Hugging Face is where we get finished models, Oumi seems to want to give us the tools to build our own from scratch — democratizing the process of creation, not just the artefacts.
3 Democratizing training on consumer hardware
4 Incoming
Connor Leahy on EleutherAI describes many scenius-type dynamics making these early teams go.
SCHEME - Stories from the Strange Frontier really want to document this all:
Our goal is to platform voices and perspectives that would otherwise get lost in the ether, slop or drudge of AI discourse.
We are looking for an open-minded orientation to the future, in all its strange complexity.
Kimi K2: Open Agentic Intelligence is an open-source agentic AI model
Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. But it goes further — meticulously optimized for agentic tasks, Kimi K2 does not just answer; it acts.
I sensed anxiety and frustration at NeurIPS‘24 – Kyunghyun Cho
LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation | LMSYS Org
cf Stability, RWKV
5 References
Footnotes
AFAICT some did, some did not, and the question is still live↩︎
