Sovereign compute for small collectives: technical implementation guide
Hardware, software, de-censoring, and the path from cloud audition to owned infrastructure
2026-03-22 — 2026-05-21
Wherein the Specifications of a Shared Inference Server Are Laid Out, Its Legal Ownership Is Settled, and the Excision of Political Guardrails From Open-Weight Models Is Treated as Routine.
This is the technical companion to my sovereign LLM post, which makes the institutional and geopolitical case for small collectives owning their own AI inference hardware. This post lays out the concrete implementation pathway: what to buy, what software to run, how to remove CCP guardrails from open-weight models, and what performance to expect.
There was a lot of research to make this work, so it is heavily AI-assisted. Be wary of hallucinations.
I’ll walk through a worked example targeting Alibaba’s Qwen3–235B-A22B served on an NVIDIA DGX Station GB300, since that’s the sweet spot I identified in the companion post. But most of this generalises to other models and hardware.
Sounds interesting? Get in touch.
1 The box
1.1 Specifications
| Component | Specification |
|---|---|
| GPU | 1× NVIDIA GB300 Grace Blackwell Ultra Superchip |
| GPU memory (HBM3e) | 252 GB @ 7.1 TB/s bandwidth |
| CPU memory (LPDDR5X) | 496 GB @ 396 GB/s bandwidth |
| Total unified memory | 784 GB |
| AI compute | ~20 PFLOPs (FP4) |
| CPU-GPU interconnect | 900 GB/s NVLink-C2C |
| Networking | 800 Gb/s CX8 SuperNIC |
| Power | 1600W TDP |
| Form factor | Desktop tower |
1.2 Buying one in Australia
The DGX Station ships through NVIDIA’s partner network. Australian vendors include XENON Systems (NVIDIA Elite Partner for ANZ, handles the full DGX line and enterprise support), Dell Australia, MSI (its XpertStation WS300 is the DGX Station reference design, listed at $85,000 USD), and MMT, Australia’s NVIDIA distributor for configure-to-order options.
Expect $135,000–$195,000 AUD landed depending on configuration, plus GST, shipping, and a UPS — we do not want 1600W of AI compute on raw mains power.
1.3 Power and cooling
1600W sustained is about the draw of a large space heater. Standard Australian outlets are 10 A at 240 V (2400 W), fine on a dedicated circuit; an electrician installing a 20 A circuit (4800 W) is a routine job if we want headroom. We’ll want decent ventilation if the machine lives in a small room. At Australian retail rates (~$0.30/kWh), running 24/7 costs about ~$350 AUD/month.
2 What the box rations
Everything below is a way of spending the resources of one box, so it helps to name them first. The DGX Station has three scarce quantities, and almost every design choice trades one against another.
Fast memory — 252 GB of HBM3e. It holds the model weights and the live conversations. Weights are a fixed cost paid once; whatever is left over after the weights sets how many people can talk to the machine at the same time.
Memory bandwidth — 7.1 TB/s out of that HBM. Generating a token means reading weights out of memory, so bandwidth, not compute, decides how fast the machine writes.
Compute — about 20 PFLOPs at FP4. This mostly bites while ingesting a prompt, where the work is one big matrix multiply rather than a memory sweep.
Mixture-of-experts shrinks the compute and bandwidth bill per token. Quantization shrinks the weights so they fit in fast memory. Sparse attention shrinks the conversations so more of them fit. Each technique below stretches one of those three.
3 Models that fit
3.1 Mixture of experts
The baseline I cost throughout is Alibaba’s Qwen3–235B-A22B, a mixture-of-experts model: 128 expert feed-forward networks per layer, of which 8 fire for any given token. It carries 235B parameters in total but only puts ~22B to work per token. Total parameters set the memory footprint; active parameters set the compute and bandwidth cost. That split is what makes self-hosting tractable — we pay 235B to store the model and 22B to run it, and the result is competitive with dense models of 70–100B parameters. The attention is grouped-query: 64 query heads, 4 KV heads, 128 head dimension, 94 transformer layers. Those last numbers come back when we count KV cache.
3.2 Quantization
The weights have to fit in 252 GB of HBM, with plenty left for conversations. Quantization sets how much room is left over.
| Quantization | Model size in memory | Fits in 252GB HBM? | HBM left for KV cache | Quality impact |
|---|---|---|---|---|
| BF16 (full) | ~470 GB | No | N/A | Baseline |
| FP8 | ~235 GB | Barely | ~17 GB (unusable) | Minimal |
| AWQ 4-bit | ~124 GB | Comfortable | ~128 GB | Modest |
| GPTQ Int4 | ~124 GB | Comfortable | ~128 GB | Similar to AWQ |
AWQ 4-bit is the pick. It puts the model in ~124 GB and leaves ~128 GB free, and that free space is the conversation budget — the thing that decides how many people the box serves at once. FP8 looks better on paper but leaves ~17 GB, which forces constant offload to the slow tier and makes multi-user serving fall over. MoE models tolerate 4-bit well, since only 8 of 128 experts touch any token. Pre-quantized weights are on HuggingFace (e.g. QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ), or we can quantize our own with AutoAWQ.
3.3 Sparse attention
A conversation’s context lives in the KV cache, and in a conventional transformer that cache grows linearly with length — every token of history costs memory, in the same scarce HBM as the weights, for as long as the conversation lives. Sparse attention breaks that link. DeepSeek’s V4 (April 2026, weights) uses an attention design DeepSeek calls DSA: it compresses the context before storing it, so a full million-token session costs roughly 9.6 GB of KV cache instead of the tens of GB a conventional model of the same class needs — about a 90% cut, per the DeepSeek tech report. It trims attention compute too, to roughly 27% of the previous generation’s at 1M context. For a box whose tightest constraint is long conversations, this relaxes the binding limit more than anything else; the concurrency maths below shows by how much.
4 Concurrency
Fast memory holds weights plus conversations:
fast memory = weights (paid once) + KV cache (paid per user, per token)
concurrent users ≈ (memory left after weights) ÷ (KV cost of one conversation)
Qwen3–235B’s grouped-query attention (4 KV heads, 128 head dimension, 94 layers) costs, per token of context:
\[\text{KV per token} = 2 \times n_\text{kv\_heads} \times d_\text{head} \times n_\text{layers} \times \text{bytes} = 2 \times 4 \times 128 \times 94 \times 2 \approx 192\text{ KB (FP16)}\]
Quantizing the KV cache to FP8 (supported by vLLM and SGLang with little quality loss) halves that to ~96 KB/token. With ~128 GB free after the 4-bit weights, the box holds about 1.4M tokens of context at once. Divide by session length to turn tokens into people:
| Session length | Concurrent users |
|---|---|
| 8K tokens | ~175 |
| 32K tokens | ~44 |
| 128K tokens | ~11 |
| 1M tokens | ~1.4 |
Concurrency falls off a cliff as conversations lengthen. A 50-person collective with 5–15 active members sits comfortably at 8K–32K and falls over at 1M, where one member running a single million-token agentic session fills the box. The comfortable numbers assume short conversations — but long-context agentic work, the thing the collective most wants, drives sessions toward the bottom row, exactly where the box is tightest.
Sparse attention is the way out of that corner. On DeepSeek V4-Flash a full 1M-token session costs ~9.6 GB of KV rather than the tens of GB above, and its ~90 GB of 4-bit weights leave ~160 GB for cache. So the bottom row moves from ~1.4 users to ~16, and at 128K and below the per-session cost is small enough that KV stops being the binding constraint at all:
conventional model V4-Flash + DSA
@ 1M-token session : ≈ 1.4 users → ≈ 16 users
@ 128K-token session: ≈ 11 users → KV no longer the limit
The box does not become infinite. Sixteen million-token sessions fitting in memory is not the same as the box generating tokens fast enough for sixteen heavy users at once — DSA’s compute cut helps, but the binding constraint moves from memory to throughput rather than vanishing. And these are illustrative figures: they reuse the conventional numbers above and DeepSeek’s published 1M figure rather than measured intermediate points, which is what phase 2 of the audition is for. Treat them as order-of-magnitude.
4.1 The warm tier
The 496 GB of LPDDR5X CPU memory works as a warm tier for KV cache. At 396 GB/s it’s roughly 18× slower than HBM — fine for sessions that have been idle a few seconds. That’s room for another ~2.6M tokens (FP8 KV) of sleeping conversations the box can resume without re-prefilling them. Both vLLM via LMCache and SGLang (native) support this tiering.
5 Throughput
Decode — generating one token at a time — is bandwidth-bound: each token reads the active expert weights out of HBM.
\[\text{active expert weights at 4-bit} \approx 22\text{B} \times 0.5\text{ bytes} = 11\text{ GB}\]
\[\text{theoretical max} \approx \frac{7.1\text{ TB/s}}{11\text{ GB}} \approx 645\text{ tok/s}\]
After attention, KV traffic, and memory-controller overhead:
| Scenario | Estimated throughput | Notes |
|---|---|---|
| Single user, single stream | 200–400 tok/s | Excellent interactive experience |
| 8 concurrent users (batched) | 100–200 tok/s per user | Continuous batching amortizes overhead |
| 32 concurrent users | 30–60 tok/s per user | Still responsive for agentic use |
| Prefill (prompt processing) | 2,000–5,000 tok/s | Compute-bound; GB300 excels here |
For agentic workloads these read better than they look. Agentic traffic is bursty: a short tool call (50–200 tokens), a wait for the tool to run, then the result arrives as a fresh prompt. Much of the time goes to prefill (fast) rather than sustained decode (slow), and the gaps let other users’ requests interleave.
6 Serving software
6.1 Inference engine
vLLM and SGLang are the two mature open-source engines; both serve Qwen3–235B with MoE-aware parallelism. vLLM is the default. The features that matter here:
- Continuous batching with PagedAttention — multiplexes requests without wasting GPU memory
- Chunked prefill — stops long prompts blocking other users’ decode steps
- Automatic prefix caching — a shared system prompt (likely in a collective) is computed once and reused
- KV offloading via LMCache — the GPU → CPU → disk warm tier above
- Expert parallelism for MoE
- OpenAI-compatible API — a drop-in replacement for commercial endpoints
A launch command for this configuration:
SGLang earns its place if multi-turn conversation dominates: its radix-tree KV cache discovers shared prefixes across turns automatically, for ~10% more throughput than vLLM on that workload with no manual tuning.
For DeepSeek V4 specifically, pin vLLM to a known-good commit, not a release tag. There are multiple reports of V4 working and then breaking across vLLM commits (NVIDIA developer forum, May 2026), which sharpens the never-run-:latest discipline rather than adding a new rule.
6.2 Orchestration
For a single box serving a small collective, vLLM or SGLang alone may be enough. If we want production-grade routing, NVIDIA Dynamo adds KV-cache-aware request routing (send a request to the GPU that already holds its context), dynamic batching, and prefill/decode disaggregation, and it’s built for the DGX stack. llm-d is the Kubernetes-native alternative, with SLA-aware load balancing, hierarchical KV offload, and scale-to-zero — more relevant once we run several machines or expose the service over a network.
6.3 NIM and portability
The path of least resistance is NVIDIA NIM, which packs the whole stack — engine, caching, API, monitoring — into one container behind an OpenAI-compatible API. Anything that speaks the OpenAI API (LangChain, AutoGen, Claude Code’s API mode) works unchanged. For a collective without a dedicated sysadmin that’s probably the right default; the trade-off is less flexibility than a hand-rolled vLLM stack.
The reason to standardise on NIM goes beyond convenience. NIM and NGC containers run identically on DGX Cloud, the major hyperscalers, and our own DGX Station. Build the workflows against that layer and we can develop on rented cloud GPUs, deploy to our own hardware with no code changes, fail back to cloud if the box dies or demand spikes, and migrate between cloud providers without rewriting anything. This is one of the few places vendor lock-in works in our favour: NVIDIA has every incentive to make its software run everywhere its hardware does.
7 Removing the guardrails
The companion post makes the case at a high level; here are the mechanics. Chinese open-weight models ship with CCP-aligned guardrails — hard refusals on Taiwan, Tiananmen, Xinjiang and the rest, plus softer framing biases. Removing them is a solved problem with two tiers of effort.
7.1 Abliteration
Tools: Heretic (fully automatic) or llm-abliteration/DECCP (sharded processing for large models).
Abliteration computes the model’s “refusal direction” in its residual stream — by contrasting activations on harmful versus harmless prompts — then orthogonalises the relevant weight matrices against it. This is linear algebra on the static weights, not training. Rent an 8×H100 node (RunPod or Lambda Labs, ~$20–25 USD/hr), load the model sharded, run the pipeline, save the weights: 2–4 hours, ~$50–100 USD. Projected abliteration only removes the mechanistically specific refusal component and preserves more general helpfulness — use it over the vanilla variant.
Refusal rates collapse to near zero. What it doesn’t fix is soft steering: the model will now answer about Taiwan but may still frame the answer with CCP-aligned assumptions. Evaluate before and after against a test set of sensitive prompts in both English and Chinese; the Shisa.AI Qwen2 censorship analysis is a good taxonomy of affected topics.
7.2 Preference fine-tuning
If abliteration alone isn’t enough — particularly for Chinese-language use, or for neutral framing on geopolitics — go to QLoRA with Direct Preference Optimization. Build 1,000–5,000 preference pairs where the preferred answer is neutral and factual and the dispreferred one is CCP-aligned, across Taiwan, Tiananmen, Xinjiang, Tibet, the South China Sea, Hong Kong, party history, and economic-data reliability. Community datasets exist for smaller Qwen models and adapt to the 235B variant.
Practical configuration:
- Framework: Unsloth (Qwen3 MoE fine-tuning, up to 12× speedup)
- LoRA rank 16–64 on q, k, v, o, gate, up, down projections
- Don’t fine-tune the MoE router — Unsloth disables this by default; destabilising expert routing causes cascading quality loss
- BF16 with LoRA is recommended over QLoRA for Qwen3 MoE (4-bit training hurts MoE architectures more)
- 4–8× H100 80GB with DeepSpeed ZeRO-3, ~20–50 hours
| Item | Cost (AUD) |
|---|---|
| GPU rental (200–400 H100-hours @ $3–4/hr) | $800–$1,600 |
| Dataset creation (DIY or contracted) | $500–$2,000 |
| Evaluation and iteration | $200–$400 |
| Total | $1,500–$4,000 |
Merge the LoRA adapters back into the base weights with Unsloth, re-quantize to AWQ 4-bit, and the merged model serves at full speed — zero adapter overhead at inference.
A full fine-tune ($5,000–$30,000+) only makes sense if we’re also doing domain adaptation (Australian legal, medical, government) and want to fold de-censoring into a larger run. Overkill for guardrail removal alone.
7.3 Newer models lag the tooling
De-censoring maturity trails the architecture. As of May 2026 the Heretic issue for V4-Flash (#310) is open and unresolved; a community-abliterated V4-Flash exists only as a GGUF for llama.cpp, so a vLLM-servable FP8 de-censored build is still work, not a download. The position itself is unchanged whichever model we pick: open weights settle whether we’re allowed to run the model, not what guardrails were baked in at the alignment stage. V4 just changes which model sits under the knife.
8 Other candidates
Qwen3–235B is the current sweet spot, but it’s not the only model that fits, and the field moves fast.
DeepSeek V4 comes in two open-weight variants. V4-Flash (284B total / 13B active) fits the box and enters the bake-off as a same-compute-class candidate, with stronger published agentic numbers, native vLLM deepseek_v4 support, and the DSA sparse attention that does so much for concurrency. Not an automatic swap for Qwen, but a serious third option. V4-Pro (1.6T / 49B active) does not fit — ~648 GB even at Q6 — so its role is the audition question: here’s the best open model money can rent, is the gap to the one we own acceptable?
Kimi K2 (1T total / 32B active) is agentic-tuned and competitive with Claude Sonnet 4 and GPT-4o on tool-calling and coding. It only fits with aggressive CPU offload (~500 GB at 4-bit, into the 784 GB unified memory), which drops single-user decode to 30–80 tok/s — usable for latency-tolerant agentic work, not ideal — and its de-censoring tooling lags.
Meta’s Llama 4 carries Western rather than CCP guardrails, so it skips the de-censoring step entirely. The largest open Llamas (including the 400B Maverick MoE, 17B active) don’t quite match Qwen3–235B on benchmarks as of early 2026, but that’s the lower-friction option if de-censoring is more hassle than we want.
A hybrid is also on the table: a small fast model (DeepSeek-R1-Distill-Qwen-32B or similar) handles routine queries, the big model is reserved for hard reasoning, and aggregate throughput rises at no extra hardware cost.
9 When smaller hardware makes sense
The DGX Station above is the sweet spot for a 25–50 person collective with serious agentic and reasoning workloads. Several adjacent shapes have different answers, and the model picks change at each tier.
9.1 The small collective (10–15 people)
A $160k box does not pay back at 10 members. The lower tier is a single H100 80 GB on a workstation chassis — Lambda Tensorbook, MSI WS-class, or a custom build with one Hopper PCIe card. Hardware spend lands at roughly $30k–$45k AUD all-in versus $135k–$195k for the DGX.
Models that fit comfortably at AWQ 4-bit on 80 GB:
| Model | ~Weights | Indication |
|---|---|---|
| Qwen3.5-35B-A3B | ~18 GB | MoE daily driver, 3B active. Quality close enough to Qwen3-235B for most everyday use; throughput much higher because only 3B fire per token. |
| Hermes-4.3-36B | ~20 GB | Agentic-tuned Nous release, December 2025. The right pick if the workload is tool-using rather than pure reasoning. |
| DeepSeek-R1-Distill-Qwen-32B | ~19 GB | Math-and-reasoning distill. Single user can run alongside one of the above on the same H100. |
Concurrency budget on an 80 GB H100: about 60 GB of KV-cache headroom, so roughly 12 users at 32K context or ~3 users at 128K — below the DGX figures, but a useful sanity check for a small group. The proposition is roughly “near-frontier quality at 25% of the hardware cost” — appropriate where sovereign compute is a hedge rather than a primary work tool.
9.2 The consumer-GPU pilot (5–10 people, or proof of concept)
Below the H100 tier, a single RTX 4090 24 GB or 5090 32 GB in a workstation is the entry point. Hardware spend ~$5k–$8k AUD all-in.
At 24 GB the practical model list shrinks to 14B-class at 4-bit:
| Model | ~Weights | Indication |
|---|---|---|
| Hermes-4-14B | ~8 GB | Agentic, function-calling. The smaller-tier baseline. |
| DeepSeek-R1-0528-Qwen3-8B | ~5 GB | Math champion of its weight class — AIME-2024 86%, matches Qwen3-235B-thinking on math benchmarks. |
| Phi-4-Reasoning-Plus-14B | ~8 GB | Different reasoning style; useful alongside the DeepSeek distill. |
The use case is a pilot — small group, exploratory workload, willingness to live with smaller models — or a permanent setup for groups whose work is dominated by short-context interactive use. Concurrency falls hard at 24 GB (~3 users at 32K), so this tier rewards short sessions and a retrieval layer over brute-force long context.
9.3 Fast-path / slow-path hybrid
The hybrid pattern mentioned in Other candidates is more useful than its placement suggests. The shape: run two models in parallel.
- Fast path on a small GPU: Qwen3.5-35B-A3B or DeepSeek-R1-0528-Qwen3-8B for routine chat, code completion, simple tool calls. Sub-second latency, low cost per query.
- Slow path on the big GPU: Qwen3-235B or V4-Flash for hard reasoning, long-context agentic, multi-step research.
A router — a small classifier model, or an explicit member choice via /model in the harness — sends each request to the appropriate backend. For agentic workloads where most steps are simple but a few need frontier-class reasoning, this is materially cheaper than running everything on the big model.
The pattern lets a collective grow from a single small box to a heterogeneous fleet without throwing away earlier investment — the H100 station that pilots phase 1 becomes the fast path in phase 3.
9.4 Per-member workstations as fallback
A natural shape for resilient infrastructure: each member’s own machine runs a small model as fallback when the colo endpoint is unavailable. Their harness (pi, Hermes, Aider) auto-fails over from the collective backend to a local Ollama or Osaurus endpoint. Smaller and slower, but it works on the day the cables are cut.
For Mac-equipped members, running LLMs locally on a Mac covers the per-member setup. The model picks overlap with this list — Hermes-4-14B and DeepSeek-R1-0528-Qwen3-8B are the obvious shared-tier choices.
9.5 When small is wrong
Three workloads make a single H100 painful, even with the right model picks:
- Long-context agentic flows (>128K context routinely). KV-cache pressure builds fast on 80 GB. Either disaggregate (small for short-context, big for long) or skip the H100 tier.
- Many simultaneous users (>15 concurrent active sessions). The DGX’s 252 GB HBM is the difference, and there is no cheap substitute.
- Frontier-quality coding tasks. Qwen3-35B does not match Qwen3-235B on hard SWE-bench-style problems; a small box does not paper over that.
If any of these dominates the workload, the small-hardware path is a stepping stone to the DGX tier, not the destination.
10 Auditioning on rented hardware
Test the full stack on rented compute before committing to a box.
Phase 1 — quick test (1–2 days, ~$100–$200 AUD). Rent a single H100 80GB on RunPod ($2–3 USD/hr), deploy the smaller Qwen3–30B-A3B with vLLM, and check our actual workloads — agentic tool-calling, document processing, code — against the API.
Phase 2 — full-scale test (1–2 weeks, ~$500–$1,000 AUD). Rent 4–8× H100s, deploy Qwen3–235B at AWQ 4-bit, run abliteration on it, then simulate 5–15 concurrent users with realistic agentic load and measure throughput, latency, and KV pressure. This is where the order-of-magnitude figures above get replaced with numbers for our own workload.
Phase 3 — DGX Cloud validation (optional, ~$1,000–$2,000 AUD). Run the DGX Cloud / NIM stack via a hyperscaler marketplace so the migration to physical hardware is seamless.
Phase 4 — buy and deploy. Order through an Australian vendor (lead times are months, not weeks), finalise the de-censored weights on cloud while waiting, then on arrival transfer weights, stand up vLLM or NIM, configure access control, monitor for a week, and open it to the collective.
11 What it costs
| Item | One-time cost (AUD) | Ongoing monthly (AUD) |
|---|---|---|
| DGX Station GB300 | $135,000–$195,000 | — |
| UPS + electrical work | $2,000–$5,000 | — |
| De-censoring (abliteration + DPO) | $1,500–$4,000 | — |
| Cloud audition (phases 1–3) | $800–$3,200 | — |
| Total one-time | ~$140,000–$207,000 | — |
| Electricity (1600W @ 24/7) | — | ~$350 |
| Hardware amortization (3yr) | — | ~$4,500 |
| Internet (business NBN or colo, see risks) | — | $150–$1,000 |
| Total monthly | — | ~$5,000–$5,900 |
| Per member/month (50 members) | — | ~$100–$118 |
Amortised over three years, the one-time cost adds roughly $55–$75/month per member in a 50-person collective, for a total cost of ownership of $155–$193/month per member depending mostly on whether we host at home or in a colo. That’s competitive with commercial API access for serious users — with no rate limits, no per-query metering, and full data sovereignty.
12 Risks
12.1 Hardware failure
One box is a single point of failure. NVIDIA enterprise support through partners like XENON buys next-business-day replacement for $10,000–$20,000/year; failing that, a collective on NIM can fail back to cloud inference seamlessly during downtime. In a serious geopolitical crisis the supply chain might be disrupted indefinitely and support might not deliver, so a spare is worth buying.
12.2 Model obsolescence
A frontier-class model today won’t be in 18 months. The hardware doesn’t share that fate — it runs whatever future models fit in its memory, and 784 GB of unified memory is generous enough to stay relevant across several generations of open weights.
12.3 Network
Australian residential broadband — older HFC and FTTN NBN especially — isn’t reliable enough for a production service: multiple brief outages a day, asymmetric upload (often 20–40 Mbps). Token generation needs only a few KB/s per user, so the problem is connection stability, not throughput. In ascending order of cost and reliability:
- Residential NBN + 5G failover (~$150/month): a dual-WAN router rides out most brief outages. Good enough for a tolerant collective.
- Business-grade NBN (ABB Business, Superloop Business, ~$150–$250/month): better SLA, static IP, priority faults, still on shared infrastructure.
- Quarter-rack colocation (~$500–$1,000/month at Equinix SY1–SY5 or NEXTDC S2): redundant fibre, generator backup, physical security. Loses the under-a-desk feel, gains dependable uptime.
The right choice depends on how many members are remote versus local, and how tolerant the collective is of downtime mid-agentic-workflow.
12.4 Operations
Someone has to administer this. NIM keeps it at “run a Docker container”, updates pull from NGC, and access control is a standard reverse proxy (nginx + OAuth2). Weekend-sysadmin complexity, not a full-time ops team.
12.5 Legal and regulatory
Running modified models for a collective may carry obligations as Australian AI regulation develops. Stay informed, take part in consultation, and document a governance framework. A legal entity (next section) is what holds the asset and caps liability.
13 Legal structure
A compute collective needs a legal entity that can own a $160k asset, collect monthly contributions from members, enter contracts (hosting, internet, support), and limit personal liability. Australian law offers several options. None is perfect; here’s how they compare for this specific use case.
13.1 Option 1: Cooperative under the Co-operatives National Law
A cooperative is purpose-built for exactly this kind of thing: a group of people who jointly own and operate infrastructure for their mutual benefit. Under the Co-operatives National Law (CNL), now harmonised across most states and territories, a cooperative:
- Requires a minimum of 5 active members (our 25–50 is well above this)
- Operates on one-member-one-vote, regardless of contribution level
- Can hold assets, enter contracts, and collect member contributions
- Can issue shares (members buy in) and pay limited returns on them
- Requires members to be “active” — i.e. actually using the co-op’s services, which is exactly what we want
- Cannot distribute profits to members beyond the limited share return, but can reinvest in better hardware, more capacity, etc.
The Co-op Federation publishes a comprehensive manual on formation and governance. Registration is through the state registrar (e.g. NSW Fair Trading).
Pros: Philosophically the best fit — the legal form matches the actual relationship (members own infrastructure they use collectively). National law means consistent rules across states. Members have clear rights.
Cons: More regulatory overhead than an incorporated association. Requires a formal formation meeting, a disclosure statement, and rules that comply with the CNL’s model rules. Annual reporting to the state registrar. If the collective is small and informal, this may feel heavy.
Estimated setup cost: $1,000–$3,000 (registration fees + legal advice on rules).
13.2 Option 2: Incorporated association
An incorporated association is the simplest and cheapest legal structure for a small not-for-profit group in Australia. Registration is at the state or territory level (~$57/year in NSW, similar in other states).
- Can hold assets, enter contracts, sue and be sued
- Members have limited liability (capped at membership fee)
- Cannot distribute profits or assets to members
- Restricted to operating primarily in the home state (interstate operations may require ASIC registration as a Registered Australian Body)
- NSW imposes a $2M gross revenue/assets threshold for registration as an association; our asset value is well under this
Pros: Cheapest, simplest to set up. Low annual compliance. Familiar form — thousands of community groups use this. Model constitutions available from state regulators.
Cons: No share structure, so the “buy-in” mechanism is less natural — member contributions would be fees, not equity. Single-state restriction could matter if the collective spans Sydney and Melbourne. The form is designed for community groups and sporting clubs, not infrastructure-owning collectives; some state registrars may query whether a compute collective fits their intended scope.
Estimated setup cost: $200–$500.
13.3 Option 3: Company limited by guarantee (CLG)
A company limited by guarantee is a federal structure registered with ASIC. Members guarantee a small amount (often $10–$100) in the event of winding up, rather than holding shares.
- Can operate nationally without interstate registration issues
- Higher annual fees (~$1,267/year to ASIC vs ~$57 for an association)
- Subject to more rigorous ASIC reporting requirements
- Can apply for ACNC registration as a charity if the collective has a genuine charitable purpose
Pros: National scope. More credible structure if the collective grows or seeks grants. Clear governance framework under the Corporations Act.
Cons: Overkill for a 50-person collective. Higher compliance cost. The guarantee structure doesn’t naturally map to “members buy shares in shared infrastructure.”
Estimated setup cost: $1,500–$4,000 (ASIC fees + legal advice on constitution).
13.4 What about ACNC registration and DGR status?
Probably not. ACNC registration requires a charitable purpose, and “a group of professionals sharing AI compute” isn’t obviously charitable. If the collective had an explicit community education or digital inclusion mission — e.g. providing AI access to under-resourced community organisations — ACNC registration might be possible, but it would constrain the collective’s operations significantly. DGR endorsement (tax-deductible donations) is even harder to obtain. Let’s not plan around it.
13.5 Recommendation
For a compute collective of 25–50 people in a single city: start as an incorporated association for speed and simplicity, with an explicit plan to convert to a cooperative under the CNL if the model proves viable and the collective wants a more natural ownership structure. The conversion process is well-documented and doesn’t require dissolving the original entity.
If the collective spans multiple states from the start, or if members want a share-based buy-in from day one, go straight to a cooperative.
Either way, let’s get a solicitor to review the constitution/rules before committing $160k of members’ money to a hardware purchase. Budget $2,000–$5,000 for initial legal advice — cheap insurance on a six-figure asset.
14 Next steps
If we’re interested in forming or joining a compute collective, the immediate actions are:
- Gauge interest: find 10–20 people who would commit to ~$150/month for sovereign AI access
- Run a cloud pilot: $200 buys a weekend of testing on rented GPUs
- Choose a legal structure: start as an incorporated association, convert to a cooperative if the model works (see legal structure section above)
- Order hardware: current lead times for DGX Station are “months, not weeks” — starting the order process early is important
- Document everything: the playbook we write becomes the replication kit for the next collective
The companion post makes the case for why. This post, I hope, makes the case for how.
