DIY Generative AI

Community resource and epistemological infrastructure

2024-06-15 — 2026-05-11

In Which the Supply Chain of Open Generative Models Is Decomposed Into Constituent Layers, From Volunteer-Scraped Datasets to Inference Code Running on Consumer Laptops and the Remarkable Facility With Which Communities Are Hacking on the Whole Stack Is Noted.

agents

bounded compute

collective knowledge

distributed

economics

edge computing

extended self

faster pussycat

incentive mechanisms

innovation

language

machine learning

neural nets

NLP

resilient tech

slop

technology

Despite the formidable barriers — technical know-how, expensive infrastructure, etc. — communities are figuring out how to DIY large generative models. Since I grew up on cypherpunk DIY myself, I can’t help being captivated.

1 Folk History of Open-Source Generative AI

The drive to democratize generative AI is a response to a field initially dominated by a few large, well-funded technology companies who probably had scale as their moat, and like wanted to maximised the perceived depth of that moat. Early, powerful models were often kept proprietary, accessible only through paid APIs. However, hackers wanna hack.

Building a large neural network takes more than a shell account and a copy of Phrack. In fact, the ingredients are the very peak of industrialised, capital-intensive, geopolitically fraught technology. Open-source AI story is largely about which ones the community can scrounge, share, or substitute for. Let us disambiguate the parts.

Compute: Big GPUs in the right configuration. The hardest piece to substitute, and the one most tightly coupled to capital and geopolitics. That said, it is one that is still relatively easy to buy if you have the cash. Communities pool, rent, beg, or scavenge, but mostly we wait for someone else’s training run to finish and then download the weights and tweak them a bit.
Data: Text, images, video, code at terabyte-to-petabyte scale, mostly scraped from the open web. LAION and EleutherAI’s The Pile are the archetypal open-data efforts. Most text data likely starts from Common Crawl. Almost nobody publishes their final training mix — too much copyright exposure, too much competitive advantage.
Algorithms: Used to be the most-published ingredient. Foundational techniques — transformer, diffusion, MoE, attention variants, RLHF basics — were discussed in papers, often early. That has changed. As the field commercialised, the US frontier labs (OpenAI, Anthropic, Google) stopped publishing the specifics of what makes their latest models work, and we have no way of telling from outside how much of the algorithmic moat is novelty vs better execution of known ideas vs better data curation. The Chinese open-weights labs seem to be publishing more methodology than the closed Western ones now. Reading the paper, when there is one, is necessary but nowhere near sufficient to know what the hell is going on.
Training code and recipes: Knowing the algorithm and being able to run it at scale on real hardware are different things, because training big models is hard. A handful of labs (AI2 with OLMo, BigScience with BLOOM, EleutherAI with Pythia, increasingly DeepSeek and HuggingFace’s SmolLM) publish the full recipe — data mix, hyperparameters, scaling decisions, the lot. Most release only the weights, leaving the kitchen door closed. “Open weights” is not “open source”.
Trained weights: The visible deliverable. Llama, Qwen, DeepSeek, Mistral, Flux dev, SDXL — what many people mean when they say “open models”. A weight file is a frozen artefact; given enough compute and the right framework, anyone can run it, but reproducing or even understanding it from scratch is not at all easy
Expertise: The scarcest ingredient and the one that does not transfer with a download. Knowing which loss curve to trust, which dataset deduplication is load-bearing, when a model is broken vs just undertrained, when to throw out a run and when to push through. Concentrated in a few labs, leaking slowly through papers, blog posts, fired employees, and the occasional indiscreet Discord.
Inference code: Once weights exist, the code that runs them is by now commodity. llama.cpp, vLLM, ollama, ComfyUI, DiffusionBee — fast, portable, hackable, often community-maintained, and frequently better than whatever the model’s authors shipped. A laptop runs a multibillion-parameter model with code that nobody at the original training lab wrote. This is the layer where DIY has most thoroughly won.
Agent harnesses: The software that wraps the model and lets it do things in the world, from simple question-answering to complex multi-step reasoning and tool use. Also actively researched in open-source.

1.1 Image generation

When models like DALL-E first appeared, they lived in a corporate walled garden. The 2022 release of Stable Diffusion opened the field — it was a potent, open-weights model that let anyone with a decent gaming PC generate high-quality images. Compute came from Stability AI; the architecture and training methodology were published as papers; the dataset came from LAION, a non-profit that had done the hard work of scraping and aligning image-text pairs at scale; inference clients sprang up from the community in short order. All four ingredients in public (although I suspect the training story and the data story might have been selectively curated).

And then we got the other thing that open-source communities excel in: Drama!

In 2024 a chunk of the original Stable Diffusion researchers walked out of Stability AI over disagreements about open-source commitments and commercialization strategy, regrouped as Black Forest Labs, and shipped the Flux family — a transformer-based model released as a “dev” version (open-weights, non-commercial) and a “pro” version (paid API). Their editing variant, FLUX.1 Kontext, was AFAICT the first open-weights edit model of professional capability.

The open/commercial friction is ongoing: corporate models (Flux pro, DALL-E 3, Midjourney) offer plug-and-play polish but lock advanced features behind APIs and are not very customizable; community-tuned ones (SDXL, the CivitAI menagerie) offer customization, autonomy, and the ability to coax out things the mainstream models’ guardrails won’t — idiosyncratic styles, violence, sexual content, nazis.

1.2 Language model pirates

When GPT-3 was kept under lock and key, a scrappy crew of volunteer researchers on a Discord server formed EleutherAI in 2020. Their explicit goal was to replicate the model and give it away. They bootstrapped their own massive dataset, The Pile, released GPT-Neo, GPT-J, and GPT-NeoX-20B in turn, and demonstrated that a distributed collective could take on the giants (at least, if a giant game them some free compute).

Meta later started open-weights releases of more powerful models, beginning with LLaMA (initially leaked rather than released; Meta has since seemingly embraced the leak as strategy), shifting the open-weights frontier into the multi-billion-parameter range. AFAICT EleutherAI got the open-LLM ball rolling, Meta made it big, Mistral added a European actor — and then…

1.3 Chinese open-weights

In 2024, Chinese labs decided there was no particular reason to follow the script. Alibaba’s Qwen series, DeepSeek-V2/V3 and the R1 reasoning model, Moonshot’s Kimi, Zhipu’s GLM, 01.AI’s Yi — and a steady drip of smaller releases from Tencent (Hunyuan), ByteDance, MiniMax, StepFun, and others — arrived with permissive licences, fluent in English and Chinese, at scales and quality matching or beating closed Western models on many benchmarks, maybe.

Given US export controls on high-end GPUs, aggressive open-sourcing has been widely read as a baller geopolitical play as much as a research strategy — recruit downstream users, distribute the costs of fine-tuning and tooling, build legitimacy outside the export-control-and-API regime, commoditize the layer above whoever is selling the chips.

DeepSeek’s R1 in January 2025 was the first to upset the apple cart — an open-weights reasoning model competitive with OpenAI’s o1, released with a paper explaining the RL training recipe in unusual detail. It cratered Nvidia’s market cap for a week. Whether the architectural and training-cost claims fully replicate or not¹, the availability changed expectations. “Open” no longer means “a year behind”.

These labs differ in how much they reveal upstream of the weights. DeepSeek seems to publish more methodology and partial training detail than most; the rest release weights, a model card, and not much more. Training data: essentially never. Same conflation as everywhere else — but the closed-model players seemingly cannot use “we’re so far ahead it doesn’t matter” as a rebuttal any more.

There are roadblocks to western use. Chinese regulation requires these models to deflect on politically sensitive topics, so the weights arrive pre-aligned to Beijing. That deflection is, at least for now, vulnerable to abliteration, so it is relatively easy for the community to remove it — uncensored DeepSeek and Qwen variants are a minor cottage industry.

1.4 Sovereign and public-good initiatives

Buzzword: “Middle powers”.

A small-but-interesting cluster of state-funded open-weights efforts exists. Distinct from the US-commercial and Chinese-commercial models, these are non-superpower governments that decided AI sovereign capacity was strategically worth doing publicly. Switzerland’s Apertus (ETH, EPFL, and the Swiss National Supercomputing Centre) publishes their whole training recipe — weights, training data, methods — built for the public good with serious effort on Swiss German, Romansh, and other long-underserved languages. Singapore’s SEA-LION does something similar for 11 Southeast Asian languages — Thai, Burmese, Lao, Khmer, Tamil, Tagalog, and others the global frontier models handle poorly. The UAE’s Falcon family from the UAE fits the same pattern, and France’s Jean Zay supercomputer was behind BigScience’s BLOOM. These projects usually trail the commercial frontier but encode different priorities — multilingualism, regional language coverage, public-interest licensing.

1.5 Hacking on a shoestring

Buzzword: Jugaad, a kind of frugal, resourceful, get-it-done innovation that thrives in the absence of capital-rich environments by making do with what’s at hand, and being unfraid to make things that will fall apart relatively quickly. Make do, hack together what’s at hand.

Especially hot with respect to globalizing AI, breaking it free from geographically concentrated power. One poster child for Jugaad is Masakhane (“we build together”), a grassroots, pan-African collective building NLP models for African languages insufficiently easy/profitable to be served by the commercial labs.

1.6 Community projects

The Chinese open-weights wave buys us the ability to host some inference in out own geographic region. I’ve been costing out a 50-person Australian compute club that would run such models on our own hardware. Follow along at danmackinlay/SOV.

1.7 Fandoms

A reliable driver for weird obsession-heavy project is… nerds who like weird stuff. Fandoms are tried-and-tested distributed innovation communities, and even for AI this is a thing.

First data point: the My Little Pony fandom. When Friendship is Magic ended in 2019, a segment of the fanbase decided they weren’t done with it yet. This led to the Pony Preservation Project, which originated on 4chan’s /mlp/ board with a mission to meticulously collect, clean, and align voice data from every episode, building a high-quality dataset specifically for AI training. That dataset became a key component for 15.ai, a landmark text-to-speech platform that launched in 2020 and let anyone speak pony voices. That led to much fun and equally much drama. The result was an explosion of fan-made content, including fully AI-voiced animations, effectively continuing the show through sheer force of will and technical ingenuity.

Second data point: AI HUB. AI HUB became the go-to Discord community for the ethically murky world of cloning the voices of actual musicians and other public figures. The community was a hotbed for sharing and developing voice-cloning techniques themselves — pre-trained models of specific singers, techniques for separating vocals from tracks to make clean training data, training pipelines, sometimes even algorithms. It was a volunteer-run decentralised R&D and distribution network that lowers the barrier to entry for a controversial and technically demanding application of AI.

I use the past tense: it it’s not quite as active as it was in 2024 after dispersing a little under he bullwhip of the copyright lawyers. Fan projects frequently test the edges of copyright law, and fans often have smaller budgets than copyright holders or large tech companies, so these projects either get mired in drama or keep a low profile.

Porn is probably in the same category, but I bet they have more money behind them.

1.8 Mandatory AI safety qualm

As with all DIY, this stuff is dangerous. Insofar as empowering people with more tools makes people more powerful, democratizing it puts it in the hands both of nice, disempowered people and of terrorists. Except these machines might themselves eventually autonomously behave nicely, or terroristically.

OTOH, there is also a danger in letting powerful models be in the hands of only a small number of geopolitical actors. YMMV

2 The Players

A motley crew of non-profits, academic-adjacent labs, grassroots collectives, and a few companies playing the open-source game.

2.1 Hugging Face

A central repository for much of open-source AI — models, datasets, and the transformers library that lets us actually use the stuff without pulling our hair out. AFAICT every other organization on this list, from EleutherAI to DeepSeek, hosts at least some models there.

2.2 CivitAI

CivitAI is a specialised marketplace for the diffusion-model fine-tuning community. It hosts the LoRAs, textual inversions, full fine-tunes, specialize image generation on niche styles — anime, photorealism, particular painters, particular fictional characters, particular kinks. Visual browsing with prompt-and-output previews; simple download; an API that the local diffusion clients (DiffusionBee, Draw Things, ComfyUI, …) plug into. More permissive content policy than Hugging Face. Pony Diffusion V6 XL — an SDXL fine-tune trained on character art from booru-style databases, with no RLHF passes applied — is a concrete example of the community making explicit choices against the alignment priorities of production models. The downstream ecosystem of hundreds of derivative fine-tunes if fertile with dark, surreal and thirsty image generation. See the project about page.

2.3 Allen Institute for AI (AI2)

Bankrolled by the late Microsoft co-founder Paul Allen. AI2 is a non-profit that tackles long-term research that corporations apparently won’t and universities seemingly can’t. They built foundational tools like AllenNLP for language research and the Semantic Scholar search engine. With their OLMo model they released not just the weights but the training data and code as well.

2.4 EleutherAI

A grassroots collective born on Discord around 2020 that replicated GPT-3-class models and gave them away. They scraped the web to assemble The Pile dataset when, AFAICT, nobody else was doing it openly at that scale. They continue to produce interesting research infrastructure and weird models, and publish interesting results.

2.5 LAION

A German non-profit of data-hoarders-for-the-people, with a mission to create and release the massive datasets that AI models need to learn from. LAION-5B (5.8 billion image-text pairs) is their best-known release. LAION has leveled up the image gen data sets. Without their work, there would be no Stable Diffusion etc. They are upstream of much of the open image-generation pipeline.

2.6 BigScience

Coordinated by Hugging Face, it brought over 1,000 researchers together in a global jam session to build BLOOM, a massive multilingual language model. It aimed to be a proof of concept for a different way of doing research — open, global, collaborative — and a counter-narrative to the idea that only secretive corporate labs could build state-of-the-art models. AFAICT nobody has repeated it at the same scale since.

2.7 Chinese open-weights labs

As of mid-2026 the roster included DeepSeek on coding, math, and reasoning. Qwen (Alibaba) does broad-purpose multilingual and is apparently the most-downloaded open-weights family on HuggingFace by some margin. Kimi (Moonshot) covers long-context and agentic tasks. GLM (Zhipu, out of Tsinghua’s KEG lab) covers bilingual chat. Yi (01.AI, Kai-Fu Lee’s outfit) does efficient general-purpose models. Thereafter can be found a long tail of smaller releases from Tencent, ByteDance, MiniMax, StepFun, and others. The pecking order shifts every quarter.

These various labs have made themselves the protagonists of open-source AI drama. In 2026, plot beats were played by Chinese commercial labs with strategic reasons to give weights away. Intersting incentive/market/geopolitcal structures in play. — export-control geopolitics, market structure (no domestic OpenAI-equivalent to undercut). Releases are mostly weights-plus-model-card; DeepSeek seems to be the outlier in publishing methodology; training data is essentially never released by anyone.

2.8 DAIR Institute

Founded by Dr. Timnit Gebru after her high-profile exit from Google. An independent research institute. “We publish interdisciplinary work uncovering and mitigating the harms of current AI systems, and research, tools and frameworks for the technological future we should build instead.” It aspires to be a ‘conscience’ of the scene, in the sense of speaking up about present harms.

2.9 Agora

A hyper-grassroots collective of engineers and creators organised on Discord Focuses on bleeding edge hip thing — multi-agent systems, real-time learning, augmenting AI reasoning — and radically open in the share-before-it-works mold. There are surely are other projects like Agora; it has the distinction of being the one I encountered.

2.10 Oumi

Oumi is Python library/platform convetion for the pipelines of data prep and training to DIY models from scratch. ## Democratizing training on consumer hardware

See distributed NN training.

3 Incoming

Building a Solidarity Ecosystem for AI
Current AI – Open Source AI Gap Map
Connor Leahy on EleutherAI describes many scenius-type dynamics making these early teams go.
Stable Diffusion is a really big deal — Willison’s 2022 take; interesting now as a prediction document.
Porean Stablediffusion drama roundup
The D/acc 2035 Scenario Essay – AI Pathways
SCHEME - Stories from the Strange Frontier really want to document this all:

Our goal is to platform voices and perspectives that would otherwise get lost in the ether, slop or drudge of AI discourse.

We are looking for an open-minded orientation to the future, in all its strange complexity.
Kimi K2: Open Agentic Intelligence is an open-source agentic AI model

Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. But it goes further — meticulously optimized for agentic tasks, Kimi K2 does not just answer; it acts.
I sensed anxiety and frustration at NeurIPS‘24 – Kyunghyun Cho
The E/Jugaad Manifesto - by Techno-Dharma
Effective Jugaad: An Ideology for Navigating Complexity and Uncertainty in the 21st Century | mostwrong.github.io
How to train your own ChatGPT Alpaca style, part one
LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation | LMSYS Org
cf Stability, RWKV
Sky-T1: Train your own O1 preview model within $450

4 References

McMahan, Moore, Ramage, et al. 2023. “Communication-Efficient Learning of Deep Networks from Decentralized Data.”

Raschka. 2026. Build a Reasoning Model (from Scratch).

Seger, Ovadya, Siddarth, et al. 2023. “Democratising AI: Multiple Meanings, Goals, and Methods.” In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’23.

Footnotes

AFAICT some did, some did not↩︎