Sovereign compute for small collectives: technical implementation guide

Hardware, software, de-censoring, and the path from cloud audition to owned infrastructure

2026-03-22 — 2026-05-21

Wherein the Specifications of a Shared Inference Server Are Laid Out, Its Legal Ownership Is Settled, and the Excision of Political Guardrails From Open-Weight Models Is Treated as Routine.

community project

cooperation

diy

engineering

faster pussycat

institutions

resilient tech

slop

straya

wonk

Follow along while we work this out— danmackinlay/SOV or Sign up for updates.

This is the technical companion to my sovereign LLM post, which makes the institutional and geopolitical case for small collectives owning their own AI inference hardware. This post lays out the concrete implementation pathway: what to buy, what software to run, how to remove CCP guardrails from open-weight models, and what performance to expect.

There was a lot of research to make this work, so it is heavily AI-assisted. Be wary of hallucinations.

I’ll walk through a worked example targeting Alibaba’s Qwen3–235B-A22B served on an NVIDIA DGX Station GB300, since that’s the sweet spot I identified in the companion post. But most of this generalises to other models and hardware.

Sounds interesting? Get in touch.

1 The box

1.1 Specifications

Component	Specification
GPU	1× NVIDIA GB300 Grace Blackwell Ultra Superchip
GPU memory (HBM3e)	252 GB @ 7.1 TB/s bandwidth
CPU memory (LPDDR5X)	496 GB @ 396 GB/s bandwidth
Total unified memory	784 GB
AI compute	~20 PFLOPs (FP4)
CPU-GPU interconnect	900 GB/s NVLink-C2C
Networking	800 Gb/s CX8 SuperNIC
Power	1600W TDP
Form factor	Desktop tower

1.2 Buying one in Australia

The DGX Station ships through NVIDIA’s partner network. Australian vendors include XENON Systems (NVIDIA Elite Partner for ANZ, handles the full DGX line and enterprise support), Dell Australia, MSI (its XpertStation WS300 is the DGX Station reference design, listed at $85,000 USD), and MMT, Australia’s NVIDIA distributor for configure-to-order options.

Expect $135,000–$195,000 AUD landed depending on configuration, plus GST, shipping, and a UPS — we do not want 1600W of AI compute on raw mains power.

1.3 Power and cooling

1600W sustained is about the draw of a large space heater. Standard Australian outlets are 10 A at 240 V (2400 W), fine on a dedicated circuit; an electrician installing a 20 A circuit (4800 W) is a routine job if we want headroom. We’ll want decent ventilation if the machine lives in a small room. At Australian retail rates (~$0.30/kWh), running 24/7 costs about ~$350 AUD/month.

2 What the box rations

Everything below is a way of spending the resources of one box, so it helps to name them first. The DGX Station has three scarce quantities, and almost every design choice trades one against another.

Fast memory — 252 GB of HBM3e. It holds the model weights and the live conversations. Weights are a fixed cost paid once; whatever is left over after the weights sets how many people can talk to the machine at the same time.

Memory bandwidth — 7.1 TB/s out of that HBM. Generating a token means reading weights out of memory, so bandwidth, not compute, decides how fast the machine writes.

Compute — about 20 PFLOPs at FP4. This mostly bites while ingesting a prompt, where the work is one big matrix multiply rather than a memory sweep.

Mixture-of-experts shrinks the compute and bandwidth bill per token. Quantization shrinks the weights so they fit in fast memory. Sparse or windowed attention shrinks the conversations so more of them fit. Each technique below stretches one of those three.

3 Models that fit

3.1 Mixture of experts

The baseline I cost throughout is Alibaba’s Qwen3–235B-A22B, a mixture-of-experts model: 128 expert feed-forward networks per layer, of which 8 fire for any given token. It carries 235B parameters in total but only puts ~22B to work per token. Total parameters set the memory footprint; active parameters set the compute and bandwidth cost. That split is what makes self-hosting tractable — we pay 235B to store the model and 22B to run it, and the result is competitive with dense models of 70–100B parameters. The attention is grouped-query: 64 query heads, 4 KV heads, 128 head dimension, 94 transformer layers. Those last numbers come back when we count KV cache.

3.2 Quantization

The weights have to fit in 252 GB of HBM, with plenty left for conversations. Quantization sets how much room is left over.

Quantization	Model size in memory	Fits in 252GB HBM?	HBM left for KV cache	Quality impact
BF16 (full)	~470 GB	No	N/A	Baseline
FP8	~235 GB	Barely	~17 GB (unusable)	Minimal
AWQ 4-bit	~124 GB	Comfortable	~128 GB	Modest
GPTQ Int4	~124 GB	Comfortable	~128 GB	Similar to AWQ

AWQ 4-bit is the pick. It puts the model in ~124 GB and leaves ~128 GB free, and that free space is the conversation budget — the thing that decides how many people the box serves at once. FP8 looks better on paper but leaves ~17 GB, which forces constant offload to the slow tier and makes multi-user serving fall over. MoE models tolerate 4-bit well, since only 8 of 128 experts touch any token. Pre-quantized weights are on HuggingFace (e.g. QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ), or we can quantize our own with AutoAWQ.

3.3 Sparse attention

A conversation’s context lives in the KV cache, and in a conventional transformer that cache grows linearly with length — every token of history costs memory, in the same scarce HBM as the weights, for as long as the conversation lives. Sparse attention breaks that link. DeepSeek’s V4 (April 2026, weights) uses an attention design DeepSeek calls DSA: it compresses the context before storing it, so a full million-token session costs roughly 9.6 GB of KV cache instead of the tens of GB a conventional model of the same class needs — about a 90% cut, per the DeepSeek tech report. It trims attention compute too, to roughly 27% of the previous generation’s at 1M context. For a box whose tightest constraint is long conversations, this relaxes the binding limit more than anything else; the concurrency maths below shows by how much.

3.4 Hybrid sliding-window attention

DSA is not the only way to break that linear-KV link, and the alternative is older and far better supported. A hybrid sliding-window model keeps ordinary attention but confines most layers to a fixed local window — each token sees only the last 128 tokens, say — and interleaves the occasional full-attention layer to carry information across the whole context. The windowed layers cost a constant amount of KV however long the conversation runs (a 128-token rolling buffer never grows); only the few global layers pay the linear-in-length price. Xiaomi’s MiMo-V2.5 is built this way — 39 of its 48 attention layers slide a 128-token window, nine attend globally, with a learnable attention-sink bias so the windowing does not forget how the conversation began. Of the ~240 KB per token (FP16) a fully dense version would spend on KV, only those nine global layers — about 45 KB — keep growing with length; the rest is capped. My back-of-envelope makes that a bit over 5×; the model card calls it nearly 6×. Quantize the KV to FP8 as before and a million-token session costs roughly 22 GB, against the tens of GB a dense model of this size would need and the ~9.6 GB DSA reaches — windowing lands between dense and DSA, much cheaper than the first and not quite as aggressive as the second.

The trade-off runs the other way to DSA’s, and in our favour. Sliding windows are the mechanism under Mistral and Gemma, years old and handled by every serving engine without special cases — none of the pin-a-known-good-commit fragility DeepSeek V4 still demands, and none of the de-censoring-tooling lag. What we give up is some long-range fidelity, since a windowed model leans on its global layers and the sink bias to reach back where DSA compresses the entire history — but MiMo’s published long-context scores hold up well past 128k and degrade rather than collapse at 1M, so the two routes land close on capability and diverge mostly on how much we have to babysit the stack.

4 Concurrency

Fast memory holds weights plus conversations:

fast memory  =  weights (paid once)  +  KV cache (paid per user, per token)

concurrent users  ≈  (memory left after weights)  ÷  (KV cost of one conversation)

Qwen3–235B’s grouped-query attention (4 KV heads, 128 head dimension, 94 layers) costs, per token of context:

\[\text{KV per token} = 2 \times n_\text{kv\_heads} \times d_\text{head} \times n_\text{layers} \times \text{bytes} = 2 \times 4 \times 128 \times 94 \times 2 \approx 192\text{ KB (FP16)}\]

Quantizing the KV cache to FP8 (supported by vLLM and SGLang with little quality loss) halves that to ~96 KB/token. With ~128 GB free after the 4-bit weights, the box holds about 1.4M tokens of context at once. Divide by session length to turn tokens into people:

Session length	Concurrent users
8K tokens	~175
32K tokens	~44
128K tokens	~11
1M tokens	~1.4

Concurrency falls off a cliff as conversations lengthen. A 50-person collective with 5–15 active members sits comfortably at 8K–32K and falls over at 1M, where one member running a single million-token agentic session fills the box. The comfortable numbers assume short conversations — but long-context agentic work, the thing the collective most wants, drives sessions toward the bottom row, exactly where the box is tightest.

Cheaper attention is the way out of that corner, and there are two routes to it. On DeepSeek V4-Flash a full 1M-token session costs ~9.6 GB of KV rather than the tens of GB above, and its ~90 GB of 4-bit weights leave ~160 GB for cache, so the bottom row climbs from ~1.4 users to ~16. MiMo-V2.5’s hybrid windowing is the gentler version of the same move: a 1M-token session costs ~22 GB instead of tens, but its fatter ~160 GB of 4-bit weights leave only ~90 GB for cache, so it trails DSA on both the size of the cut and the room left to spend it. Either way the bottom row stops being a cliff:

Session length	conventional (Qwen)	MiMo-V2.5 + SWA	V4-Flash + DSA
1M-token	~1.4 users	~4 users	~16 users
128K-token	~11 users	~30 users	KV no longer the limit

The box does not become infinite. Sixteen million-token sessions fitting in memory is not the same as the box generating tokens fast enough for sixteen heavy users at once — DSA’s compute cut helps, but the constraint moves from memory to throughput rather than vanishing. And these are illustrative figures: they reuse the conventional numbers above and DeepSeek’s published 1M figure rather than measured intermediate points, which is what phase 2 of the audition is for. Treat them as order-of-magnitude.

4.1 The warm tier

The 496 GB of LPDDR5X CPU memory works as a warm tier for KV cache. At 396 GB/s it’s roughly 18× slower than HBM — fine for sessions that have been idle a few seconds. That’s room for another ~2.6M tokens (FP8 KV) of sleeping conversations the box can resume without re-prefilling them. Both vLLM via LMCache and SGLang (native) support this tiering.

5 Throughput

Decode — generating one token at a time — is bandwidth-bound: each token reads the active expert weights out of HBM.

\[\text{active expert weights at 4-bit} \approx 22\text{B} \times 0.5\text{ bytes} = 11\text{ GB}\]

\[\text{theoretical max} \approx \frac{7.1\text{ TB/s}}{11\text{ GB}} \approx 645\text{ tok/s}\]

After attention, KV traffic, and memory-controller overhead:

Scenario	Estimated throughput	Notes
Single user, single stream	200–400 tok/s	Excellent interactive experience
8 concurrent users (batched)	100–200 tok/s per user	Continuous batching amortizes overhead
32 concurrent users	30–60 tok/s per user	Still responsive for agentic use
Prefill (prompt processing)	2,000–5,000 tok/s	Compute-bound; GB300 excels here

For agentic workloads these read better than they look. Agentic traffic is bursty: a short tool call (50–200 tokens), a wait for the tool to run, then the result arrives as a fresh prompt. Much of the time goes to prefill (fast) rather than sustained decode (slow), and the gaps let other users’ requests interleave.

6 Serving software

6.1 Inference engine

vLLM and SGLang are the two mature open-source engines; both serve Qwen3–235B with MoE-aware parallelism. vLLM is the default. The features that matter here:

Continuous batching with PagedAttention — multiplexes requests without wasting GPU memory
Chunked prefill — stops long prompts blocking other users’ decode steps
Automatic prefix caching — a shared system prompt (likely in a collective) is computed once and reused
KV offloading via LMCache — the GPU → CPU → disk warm tier above
Expert parallelism for MoE
OpenAI-compatible API — a drop-in replacement for commercial endpoints

A launch command for this configuration:

vllm serve QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95

SGLang earns its place if multi-turn conversation dominates: its radix-tree KV cache discovers shared prefixes across turns automatically, for ~10% more throughput than vLLM on that workload with no manual tuning.

For DeepSeek V4 specifically, pin vLLM to a known-good commit, not a release tag. There are multiple reports of V4 working and then breaking across vLLM commits (NVIDIA developer forum, May 2026), which sharpens the never-run-:latest discipline rather than adding a new rule.

6.2 Orchestration

For a single box serving a small collective, vLLM or SGLang alone may be enough. If we want production-grade routing, NVIDIA Dynamo adds KV-cache-aware request routing (send a request to the GPU that already holds its context), dynamic batching, and prefill/decode disaggregation, and it’s built for the DGX stack. llm-d is the Kubernetes-native alternative, with SLA-aware load balancing, hierarchical KV offload, and scale-to-zero — more relevant once we run several machines or expose the service over a network.

6.3 NIM and portability

The path of least resistance is NVIDIA NIM, which packs the whole stack — engine, caching, API, monitoring — into one container behind an OpenAI-compatible API. Anything that speaks the OpenAI API (LangChain, AutoGen, Claude Code’s API mode) works unchanged. For a collective without a dedicated sysadmin that’s probably the right default; the trade-off is less flexibility than a hand-rolled vLLM stack.

The reason to standardise on NIM goes beyond convenience. NIM and NGC containers run identically on DGX Cloud, the major hyperscalers, and our own DGX Station. Build the workflows against that layer and we can develop on rented cloud GPUs, deploy to our own hardware with no code changes, fail back to cloud if the box dies or demand spikes, and migrate between cloud providers without rewriting anything. This is one of the few places vendor lock-in works in our favour: NVIDIA has every incentive to make its software run everywhere its hardware does.

7 Removing the guardrails

The companion post makes the case at a high level; here are the mechanics. Chinese open-weight models ship with CCP-aligned guardrails — hard refusals on Taiwan, Tiananmen, Xinjiang and the rest, plus softer framing biases. Removing them is a solved problem with two tiers of effort.

7.1 Abliteration

Tools: Heretic (fully automatic) or llm-abliteration/DECCP (sharded processing for large models).

Abliteration computes the model’s “refusal direction” in its residual stream — by contrasting activations on harmful versus harmless prompts — then orthogonalises the relevant weight matrices against it. This is linear algebra on the static weights, not training. Rent an 8×H100 node (RunPod or Lambda Labs, ~$20–25 USD/hr), load the model sharded, run the pipeline, save the weights: 2–4 hours, ~$50–100 USD. Projected abliteration only removes the mechanistically specific refusal component and preserves more general helpfulness — use it over the vanilla variant.

Refusal rates collapse to near zero. What it doesn’t fix is soft steering: the model will now answer about Taiwan but may still frame the answer with CCP-aligned assumptions. Evaluate before and after against a test set of sensitive prompts in both English and Chinese; the Shisa.AI Qwen2 censorship analysis is a good taxonomy of affected topics.

7.2 Preference fine-tuning

If abliteration alone isn’t enough — particularly for Chinese-language use, or for neutral framing on geopolitics — go to QLoRA with Direct Preference Optimization. Build 1,000–5,000 preference pairs where the preferred answer is neutral and factual and the dispreferred one is CCP-aligned, across Taiwan, Tiananmen, Xinjiang, Tibet, the South China Sea, Hong Kong, party history, and economic-data reliability. Community datasets exist for smaller Qwen models and adapt to the 235B variant.

Practical configuration:

Framework: Unsloth (Qwen3 MoE fine-tuning, up to 12× speedup)
LoRA rank 16–64 on q, k, v, o, gate, up, down projections
Don’t fine-tune the MoE router — Unsloth disables this by default; destabilising expert routing causes cascading quality loss
BF16 with LoRA is recommended over QLoRA for Qwen3 MoE (4-bit training hurts MoE architectures more)
4–8× H100 80GB with DeepSpeed ZeRO-3, ~20–50 hours

Item	Cost (AUD)
GPU rental (200–400 H100-hours @ $3–4/hr)	$800–$1,600
Dataset creation (DIY or contracted)	$500–$2,000
Evaluation and iteration	$200–$400
Total	$1,500–$4,000

Merge the LoRA adapters back into the base weights with Unsloth, re-quantize to AWQ 4-bit, and the merged model serves at full speed — zero adapter overhead at inference.

A full fine-tune ($5,000–$30,000+) only makes sense if we’re also doing domain adaptation (Australian legal, medical, government) and want to fold de-censoring into a larger run. Overkill for guardrail removal alone.

7.3 Newer models lag the tooling

De-censoring maturity trails the architecture. As of May 2026 the Heretic issue for V4-Flash (#310) is open and unresolved; a community-abliterated V4-Flash exists only as a GGUF for llama.cpp, so a vLLM-servable FP8 de-censored build is still work, not a download. The position itself is unchanged whichever model we pick: open weights settle whether we’re allowed to run the model, not what guardrails were baked in at the alignment stage. V4 just changes which model sits under the knife.

8 Other candidates

Qwen3–235B is the current sweet spot, but it’s not the only model that fits, and the field moves fast.

DeepSeek V4 comes in two open-weight variants. V4-Flash (284B total / 13B active) fits the box and enters the bake-off as a same-compute-class candidate, with stronger published agentic numbers, native vLLM deepseek_v4 support, and the DSA sparse attention that does so much for concurrency. Not an automatic swap for Qwen, but a serious third option. V4-Pro (1.6T / 49B active) does not fit — ~648 GB even at Q6 — so its role is the audition question: here’s the best open model money can rent, is the gap to the one we own acceptable?

Kimi K2 (1T total / 32B active) is agentic-tuned and competitive with Claude Sonnet 4 and GPT-5.x on tool-calling and coding. It only fits with aggressive CPU offload (~500 GB at 4-bit, into the 784 GB unified memory), which drops single-user decode to 30–80 tok/s — usable for latency-tolerant agentic work, not ideal — and its de-censoring tooling lags.

Meta’s Llama 4 carries Western rather than CCP guardrails, so it skips the de-censoring step entirely. The largest open Llamas (including the 400B Maverick MoE, 17B active) don’t quite match Qwen3–235B on benchmarks as of early 2026, but that’s the lower-friction option if de-censoring is more hassle than we want.

Xiaomi MiMo-V2.5 (310B total / 15B active) is the April 2026 entrant that maps almost exactly onto the Qwen baseline, only leaner per token. At AWQ 4-bit the weights land around 160 GB — inside the 252 GB HBM with room left for KV cache — and 15B active against Qwen’s 22B means decode runs a little faster. It also brings its own KV relief: its hybrid sliding-window attention caps the cost of long conversations the way DeepSeek’s DSA does, by an older and better-supported route — which matters most on this box, where long-context KV is the binding constraint. It is natively omnimodal (text, image, video, and audio in one model) with a 1M-token context — the multimodal coverage Qwen3–235B simply does not have. The non-Pro size is the one I would audition: some users report it tool-calls better than the much larger Pro at a fraction of the cost, which is the trait a collective running agentic workloads actually wants. It is a Chinese open-weight model, so it carries the same CCP-aligned guardrails as Qwen, and the de-censoring section applies unchanged. The weights ship under a plain MIT licence — commercial use, redistribution, and fine-tuning explicitly allowed with no authorization — so the abliteration and preference-tuning described there are unambiguously within licence, with none of the use-restriction strings a community licence carries. The usual caveat still holds: check whether the abliteration tooling has caught up to its architecture before counting on it.

Its big sibling MiMo-V2.5-Pro (1.02T total / 42B active) plays the role V4-Pro and Kimi K2 play: it does not fit in HBM — ~510 GB at 4-bit, spilling into the 784 GB unified memory with the same offload penalty Kimi pays — so its job is the audition yardstick, the best open model we can rent measured against the one we own. Xiaomi serve it hosted behind an OpenAI-compatible API (also on OpenRouter), and a TileRT build pushes it past 1000 tok/s — a useful number to beat when deciding whether owning still beats renting.

A hybrid is also on the table: a small fast model (DeepSeek-R1-Distill-Qwen-32B or similar) handles routine queries, the big model is reserved for hard reasoning, and aggregate throughput rises at no extra hardware cost.

9 When smaller hardware makes sense

The DGX Station above is the sweet spot for a 25–50 person collective with serious agentic and reasoning workloads. Several adjacent shapes have different answers, and the model picks change at each tier.

9.1 The small collective (10–15 people)

A $160k box does not pay back at 10 members. The lower tier is a single H100 80 GB on a workstation chassis — Lambda Tensorbook, MSI WS-class, or a custom build with one Hopper PCIe card. Hardware spend lands at roughly $30k–$45k AUD all-in versus $135k–$195k for the DGX.

Models that fit comfortably at AWQ 4-bit on 80 GB:

Model	~Weights	Indication
Qwen3.5-35B-A3B	~18 GB	MoE daily driver, 3B active. Quality close enough to Qwen3-235B for most everyday use; throughput much higher because only 3B fire per token.
Hermes-4.3-36B	~20 GB	Agentic-tuned Nous release, December 2025. The right pick if the workload is tool-using rather than pure reasoning.
DeepSeek-R1-Distill-Qwen-32B	~19 GB	Math-and-reasoning distill. Single user can run alongside one of the above on the same H100.

Concurrency budget on an 80 GB H100: about 60 GB of KV-cache headroom, so roughly 12 users at 32K context or ~3 users at 128K — below the DGX figures, but a useful sanity check for a small group. The proposition is roughly “near-frontier quality at 25% of the hardware cost” — appropriate where sovereign compute is a hedge rather than a primary work tool.

9.2 The consumer-GPU pilot (5–10 people, or proof of concept)

Below the H100 tier, a single RTX 4090 24 GB or 5090 32 GB in a workstation is the entry point. Hardware spend ~$5k–$8k AUD all-in.

At 24 GB the practical model list shrinks to 14B-class at 4-bit:

Model	~Weights	Indication
Hermes-4-14B	~8 GB	Agentic, function-calling. The smaller-tier baseline.
DeepSeek-R1-0528-Qwen3-8B	~5 GB	Math champion of its weight class — AIME-2024 86%, matches Qwen3-235B-thinking on math benchmarks.
Phi-4-Reasoning-Plus-14B	~8 GB	Different reasoning style; useful alongside the DeepSeek distill.

The use case is a pilot — small group, exploratory workload, willingness to live with smaller models — or a permanent setup for groups whose work is dominated by short-context interactive use. Concurrency falls hard at 24 GB (~3 users at 32K), so this tier rewards short sessions and a retrieval layer over brute-force long context.

9.3 Fast-path / slow-path hybrid

The hybrid pattern mentioned in Other candidates is more useful than its placement suggests. The shape: run two models in parallel.

Fast path on a small GPU: Qwen3.5-35B-A3B or DeepSeek-R1-0528-Qwen3-8B for routine chat, code completion, simple tool calls. Sub-second latency, low cost per query.
Slow path on the big GPU: Qwen3-235B or V4-Flash for hard reasoning, long-context agentic, multi-step research.

A router — a small classifier model, or an explicit member choice via /model in the harness — sends each request to the appropriate backend. For agentic workloads where most steps are simple but a few need frontier-class reasoning, this is materially cheaper than running everything on the big model.

The pattern lets a collective grow from a single small box to a heterogeneous fleet without throwing away earlier investment — the H100 station that pilots phase 1 becomes the fast path in phase 3.

9.4 Per-member workstations as fallback

A natural shape for resilient infrastructure: each member’s own machine runs a small model as fallback when the colo endpoint is unavailable. Their harness (pi, Hermes, Aider) auto-fails over from the collective backend to a local Ollama or Osaurus endpoint. Smaller and slower, but it works on the day the cables are cut.

For Mac-equipped members, running LLMs locally on a Mac covers the per-member setup. The model picks overlap with this list — Hermes-4-14B and DeepSeek-R1-0528-Qwen3-8B are the obvious shared-tier choices.

9.5 When small is wrong

Three workloads make a single H100 painful, even with the right model picks:

Long-context agentic flows (>128K context routinely). KV-cache pressure builds fast on 80 GB. Either disaggregate (small for short-context, big for long) or skip the H100 tier.
Many simultaneous users (>15 concurrent active sessions). The DGX’s 252 GB HBM is the difference, and there is no cheap substitute.
Frontier-quality coding tasks. Qwen3-35B does not match Qwen3-235B on hard SWE-bench-style problems; a small box does not paper over that.

If any of these dominates the workload, the small-hardware path is a stepping stone to the DGX tier, not the destination.

10 Auditioning on rented hardware

Test the full stack on rented compute before committing to a box.

Phase 1 — quick test (1–2 days, ~$100–$200 AUD). Rent a single H100 80GB on RunPod ($2–3 USD/hr), deploy the smaller Qwen3–30B-A3B with vLLM, and check our actual workloads — agentic tool-calling, document processing, code — against the API.

Phase 2 — full-scale test (1–2 weeks, ~$500–$1,000 AUD). Rent 4–8× H100s, deploy Qwen3–235B at AWQ 4-bit, run abliteration on it, then simulate 5–15 concurrent users with realistic agentic load and measure throughput, latency, and KV pressure. This is where the order-of-magnitude figures above get replaced with numbers for our own workload.

Phase 3 — DGX Cloud validation (optional, ~$1,000–$2,000 AUD). Run the DGX Cloud / NIM stack via a hyperscaler marketplace so the migration to physical hardware is seamless.

Phase 4 — buy and deploy. Order through an Australian vendor (lead times are months, not weeks), finalise the de-censored weights on cloud while waiting, then on arrival transfer weights, stand up vLLM or NIM, configure access control, monitor for a week, and open it to the collective.

11 What it costs

Item	One-time cost (AUD)	Ongoing monthly (AUD)
DGX Station GB300	$135,000–$195,000	—
UPS + electrical work	$2,000–$5,000	—
De-censoring (abliteration + DPO)	$1,500–$4,000	—
Cloud audition (phases 1–3)	$800–$3,200	—
Total one-time	~$140,000–$207,000	—
Electricity (1600W @ 24/7)	—	~$350
Hardware amortization (3yr)	—	~$4,500
Internet (business NBN or colo, see risks)	—	$150–$1,000
Total monthly	—	~$5,000–$5,900
Per member/month (50 members)	—	~$100–$118

Amortised over three years, the one-time cost adds roughly $55–$75/month per member in a 50-person collective, for a total cost of ownership of $155–$193/month per member depending mostly on whether we host at home or in a colo. That’s competitive with commercial API access for serious users — with no rate limits, no per-query metering, and full data sovereignty.

12 Risks

12.1 Hardware failure

One box is a single point of failure. NVIDIA enterprise support through partners like XENON buys next-business-day replacement for $10,000–$20,000/year; failing that, a collective on NIM can fail back to cloud inference seamlessly during downtime. In a serious geopolitical crisis the supply chain might be disrupted indefinitely and support might not deliver, so a spare is worth buying.

12.2 Model obsolescence

A frontier-class model today won’t be in 18 months. The hardware doesn’t share that fate — it runs whatever future models fit in its memory, and 784 GB of unified memory is generous enough to stay relevant across several generations of open weights.

12.3 Network

Australian residential broadband — older HFC and FTTN NBN especially — isn’t reliable enough for a production service: multiple brief outages a day, asymmetric upload (often 20–40 Mbps). Token generation needs only a few KB/s per user, so the problem is connection stability, not throughput. In ascending order of cost and reliability:

Residential NBN + 5G failover (~$150/month): a dual-WAN router rides out most brief outages. Good enough for a tolerant collective.
Business-grade NBN (ABB Business, Superloop Business, ~$150–$250/month): better SLA, static IP, priority faults, still on shared infrastructure.
Quarter-rack colocation (~$500–$1,000/month at Equinix SY1–SY5 or NEXTDC S2): redundant fibre, generator backup, physical security. Loses the under-a-desk feel, gains dependable uptime.

The right choice depends on how many members are remote versus local, and how tolerant the collective is of downtime mid-agentic-workflow.

12.4 Operations

Someone has to administer this. NIM keeps it at “run a Docker container”, updates pull from NGC, and access control is a standard reverse proxy (nginx + OAuth2). Weekend-sysadmin complexity, not a full-time ops team.

12.5 Legal and regulatory

Running modified models for a collective may carry obligations as Australian AI regulation develops. Stay informed, take part in consultation, and document a governance framework. A legal entity (next section) is what holds the asset and caps liability.

13 Legal structure

A compute collective needs a legal entity that can own a $160k asset, collect monthly contributions from members, enter contracts (hosting, internet, support), and limit personal liability. Australian law offers several options. None is perfect; here’s how they compare for this specific use case.

13.1 Option 1: Cooperative under the Co-operatives National Law

A cooperative is purpose-built for exactly this kind of thing: a group of people who jointly own and operate infrastructure for their mutual benefit. Under the Co-operatives National Law (CNL), now harmonised across most states and territories, a cooperative:

Requires a minimum of 5 active members (our 25–50 is well above this)
Operates on one-member-one-vote, regardless of contribution level
Can hold assets, enter contracts, and collect member contributions
Can issue shares (members buy in) and pay limited returns on them
Requires members to be “active” — i.e. actually using the co-op’s services, which is exactly what we want
Cannot distribute profits to members beyond the limited share return, but can reinvest in better hardware, more capacity, etc.

The Co-op Federation publishes a comprehensive manual on formation and governance. Registration is through the state registrar (e.g. NSW Fair Trading).

Pros: Philosophically the best fit — the legal form matches the actual relationship (members own infrastructure they use collectively). National law means consistent rules across states. Members have clear rights.

Cons: More regulatory overhead than an incorporated association. Requires a formal formation meeting, a disclosure statement, and rules that comply with the CNL’s model rules. Annual reporting to the state registrar. If the collective is small and informal, this may feel heavy.

Estimated setup cost: $1,000–$3,000 (registration fees + legal advice on rules).

13.2 Option 2: Incorporated association

An incorporated association is the simplest and cheapest legal structure for a small not-for-profit group in Australia. Registration is at the state or territory level (~$57/year in NSW, similar in other states).

Can hold assets, enter contracts, sue and be sued
Members have limited liability (capped at membership fee)
Cannot distribute profits or assets to members
Restricted to operating primarily in the home state (interstate operations may require ASIC registration as a Registered Australian Body)
NSW imposes a $2M gross revenue/assets threshold for registration as an association; our asset value is well under this

Pros: Cheapest, simplest to set up. Low annual compliance. Familiar form — thousands of community groups use this. Model constitutions available from state regulators.

Cons: No share structure, so the “buy-in” mechanism is less natural — member contributions would be fees, not equity. Single-state restriction could matter if the collective spans Sydney and Melbourne. The form is designed for community groups and sporting clubs, not infrastructure-owning collectives; some state registrars may query whether a compute collective fits their intended scope.

Estimated setup cost: $200–$500.

13.3 Option 3: Company limited by guarantee (CLG)

A company limited by guarantee is a federal structure registered with ASIC. Members guarantee a small amount (often $10–$100) in the event of winding up, rather than holding shares.

Can operate nationally without interstate registration issues
Higher annual fees (~$1,267/year to ASIC vs ~$57 for an association)
Subject to more rigorous ASIC reporting requirements
Can apply for ACNC registration as a charity if the collective has a genuine charitable purpose

Pros: National scope. More credible structure if the collective grows or seeks grants. Clear governance framework under the Corporations Act.

Cons: Overkill for a 50-person collective. Higher compliance cost. The guarantee structure doesn’t naturally map to “members buy shares in shared infrastructure.”

Estimated setup cost: $1,500–$4,000 (ASIC fees + legal advice on constitution).

13.4 What about ACNC registration and DGR status?

Probably not. ACNC registration requires a charitable purpose, and “a group of professionals sharing AI compute” isn’t obviously charitable. If the collective had an explicit community education or digital inclusion mission — e.g. providing AI access to under-resourced community organisations — ACNC registration might be possible, but it would constrain the collective’s operations significantly. DGR endorsement (tax-deductible donations) is even harder to obtain. Let’s not plan around it.

13.5 Recommendation

For a compute collective of 25–50 people in a single city: start as an incorporated association for speed and simplicity, with an explicit plan to convert to a cooperative under the CNL if the model proves viable and the collective wants a more natural ownership structure. The conversion process is well-documented and doesn’t require dissolving the original entity.

If the collective spans multiple states from the start, or if members want a share-based buy-in from day one, go straight to a cooperative.

Either way, let’s get a solicitor to review the constitution/rules before committing $160k of members’ money to a hardware purchase. Budget $2,000–$5,000 for initial legal advice — cheap insurance on a six-figure asset.

14 Next steps

If we’re interested in forming or joining a compute collective, the immediate actions are:

Gauge interest: find 10–20 people who would commit to ~$150/month for sovereign AI access
Run a cloud pilot: $200 buys a weekend of testing on rented GPUs
Choose a legal structure: start as an incorporated association, convert to a cooperative if the model works (see legal structure section above)
Order hardware: current lead times for DGX Station are “months, not weeks” — starting the order process early is important
Document everything: the playbook we write becomes the replication kit for the next collective

The companion post makes the case for why. This post, I hope, makes the case for how.