Sovereign compute for small collectives: a technical implementation guide

Hardware, software, de-censoring, and the path from cloud audition to owned infrastructure

2026-03-22 — 2026-05-17

Wherein Australian Cooperative Law Is Considered Alongside GPU Memory Budgets, and the Removal of Political-Content Restrictions From Open-Weight Models Is Addressed via Weight-Matrix Orthogonalization.

AI
community project
cooperation
diy
engineering
faster pussycat
institutions
resilient tech
straya
wonk

Follow along while we work this out— danmackinlay/SOV or Sign up for updates.

Figure 1

This is the technical companion to my sovereign LLM post, which makes the institutional and geopolitical case for small collectives owning their own AI inference hardware. This post lays out the concrete implementation pathway: what to buy, what software to run, how to remove CCP guardrails from open-weight models, and what performance to expect.

There was a lot of research to make this work, so it is heavily AI-assisted. Be wary of hallucinations.

I’ll walk through a worked example targeting Alibaba’s Qwen3–235B-A22B served on an NVIDIA DGX Station GB300, since that’s the sweet spot I identified in the companion post. But most of this generalises to other models and hardware.

Sounds interesting? Get in touch.

1 Hardware: DGX Station GB300

1.1 Specifications

Component Specification
GPU NVIDIA GB300 Grace Blackwell Ultra Superchip
GPU memory (HBM3e) 252 GB @ 7.1 TB/s bandwidth
CPU memory (LPDDR5X) 496 GB @ 396 GB/s bandwidth
Total unified memory 784 GB
AI compute ~20 PFLOPs (FP4)
CPU-GPU interconnect 900 GB/s NVLink-C2C
Networking 800 Gb/s CX8 SuperNIC
Power 1600W TDP
Form factor Desktop tower

1.2 Australian purchasing options

The DGX Station ships through NVIDIA’s partner network. Australian vendors include:

  • XENON Systems — NVIDIA Elite Partner for ANZ, handles the full DGX product line, including enterprise support contracts
  • Dell Australia — official NVIDIA Partner Network reseller
  • MSI — its XpertStation WS300 is the DGX Station reference design, listed at $85,000 USD
  • MMT — Australia’s NVIDIA distributor, handy for configure-to-order options

We should expect $135,000–$195,000 AUD landed, depending on configuration. We should budget for GST, shipping, and a UPS (we do not want 1600W of AI compute on raw mains power in Australia).

1.3 Power and cooling

1600W sustained is significant but not exotic—it’s about the same as a large space heater. Standard Australian residential outlets are 10 A at 240 V (2400 W), which is fine as long as it’s on its own circuit; we’d want an electrician to install a dedicated 20 A circuit for anything larger (4800 W at 240 V), which is a routine job. We’ll want adequate room ventilation or a dedicated cooling solution if we keep the machine in a small space.

Monthly power cost at Australian retail rates (~$0.30/kWh): ~$350 AUD running 24/7.

2 Model selection and quantization

2.1 Why Qwen3–235B-A22B

This is a mixture-of-experts (MoE) model with 128 expert feed-forward networks per layer, of which 8 activate per token. The key advantage of MoE for self-hosted inference:

  • Total parameters: 235B (determines model size on disk/memory)
  • Active parameters per token: ~22B (determines compute cost)
  • Effective capability: competitive with dense models of 70–100B parameters, at a fraction of the compute cost per token

The architecture uses grouped-query attention (GQA) with 64 query heads, 4 KV heads, 128 head dimension, and 94 transformer layers.

2.2 Quantization trade-offs

Quantization Model size in memory Fits in 252GB HBM? HBM remaining for KV cache Quality impact
BF16 (full) ~470 GB No N/A Baseline
FP8 ~235 GB Barely ~17 GB (unusable) Minimal
AWQ 4-bit ~124 GB Comfortable ~128 GB Modest; MoE models are resilient to quantization since only 8/128 experts activate
GPTQ Int4 ~124 GB Comfortable ~128 GB Similar to AWQ

Recommendation: AWQ 4-bit. The 128 GB of free HBM is critical—it’s our KV cache budget, which directly determines how many concurrent conversations the machine can serve.

FP8 looks tempting (better quality) but leaves only 17 GB for KV cache, forcing constant offloading to CPU memory at 18× lower bandwidth. In practice, this makes multi-user serving unworkable.

Pre-quantized weights are available on HuggingFace (e.g. ‘QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ’) or we can quantize them ourselves using AutoAWQ.

3 KV cache budget and concurrency

With the model in AWQ 4-bit occupying ~124 GB of HBM, we have ~128 GB remaining for KV cache (conversation context).

3.1 Per-token KV cache cost

Qwen3–235B uses GQA with 4 KV heads, 128 head dimension, 94 layers:

\[\text {KV per token} = 2 \times n_\text {kv\_heads} \times d_\text {head} \times n_\text {layers} \times \text {bytes} = 2 \times 4 \times 128 \times 94 \times 2 \approx 192\text {KB (FP16)}\]

With FP8 KV quantization (supported by Vllm and SGLang with minimal quality loss): ~96 KB/token.

3.2 Concurrency table

KV precision Total tokens in HBM Concurrent sessions @ 8K ctx @ 32K ctx @ 128K ctx
FP16 ~714K ~89 ~22 ~5
FP8 ~1.4M ~178 ~44 ~11

For a 50-person collective where maybe 5–15 people are actively using the system at once, FP8 KV quantization at 8K–32K context is very comfortable.

3.3 KV offloading to CPU memory

The 496 GB of LPDDR5× CPU memory is available as a “warm tier” for KV cache. At 396 GB/s bandwidth (vs 7.1 TB/s for HBM), it’s roughly 18× slower for KV operations, but perfectly adequate for sessions that have been idle for a few seconds.

This effectively adds another ~2.6 million tokens of context storage (at FP8 KV), enough to keep dozens of “sleeping” conversations warm without re-prefilling them when the user returns.

Both Vllm (via LMCache connector) and SGLang (native) support this tiered KV architecture.

4 Sparse attention and the concurrency wall

The concurrency numbers above assume a conventional attention mechanism: the KV cache grows linearly with context length and competes with the weights for HBM. A model released in April 2026 breaks that assumption, and it changes the arithmetic the whole sovereign-per-box case rests on.

4.1 The accounting identity

Restating the constraint from the section above as bluntly as possible:

fast GPU memory  =  weights (paid once)  +  KV cache (paid per user, per token)

concurrent users  ≈  (memory left after weights)  ÷  (KV cost of one conversation)

Weights are a fixed cost shared by everyone; the KV cache is per-conversation and per-token, and it sits in the same scarce HBM as the weights. Using the figures already audited above—252 GB HBM, ~124 GB for the model at 4-bit, leaving ~128 GB for KV, at ~96 KB/token (FP8 KV)—the box holds about 1.4M tokens of context at once. Divide by session length to turn tokens into people:

Session length Concurrent users (conventional model)
8K tokens ~175
32K tokens ~44
128K tokens ~11
1M tokens ~1.4

Read top to bottom: concurrency falls off a cliff as conversations lengthen. A 50-person collective with 5–15 active members is comfortable at 8K–32K and falls over at 1M, where a single member running one million-token agentic session fills the box. This is the assumption hiding in the economics: the comfortable numbers assume short conversations, and the collective’s headline use case—long-context agentic work—drives sessions straight at the bottom row.

4.2 The same arithmetic with DeepSeek V4

DeepSeek V4 (April 2026, weights) ships two open-weight variants:

  • V4-Pro—1.6T total / 49B active. Rivals top closed models on agentic coding and reasoning.
  • V4-Flash—284B total / 13B active. Reasoning approaches V4-Pro; multi-tier thinking modes.

The parameter counts are not the point; the attention design is. V4 uses sparse attention (DeepSeek calls it DSA): it compresses context before storing it, so a full million-token conversation costs roughly 9.6 GB of KV cache rather than the tens of GB a conventional model of the same class needs—about a 90% reduction, per the DeepSeek tech report and corroborated by independent write-ups from Together.ai and the Hugging Face tech blog.

V4-Flash changes one input to the calculation above. Same identity, new numbers:

Quantity Value Note
Fast memory (M) 252 GB same box
V4-Flash weights, 4-bit (W) ≈90 GB search-corroborated; conservative
KV budget B = M − W ≈160 GB rounded down for overhead
KV cost of a full 1M-token session ≈9.6 GB DeepSeek published figure
concurrent users, each holding a FULL 1M-token context:
   B ÷ 9.6 GB  =  160 GB ÷ 9.6 GB  ≈  16–17 users

The two worked examples side by side, at the regime the collective cares about:

                        conventional model      V4-Flash + DSA
  @ 1M-token session  :     ≈ 1.4 users    →      ≈ 16 users     (~11× more)
  @ 128K-token session:     ≈ 11 users     →      hundreds*      (KV no longer the limit)

*At 128K each session uses a small fraction of the 9.6 GB figure, so KV memory stops being what limits the box in that regime; the constraint moves elsewhere (see caveats).

The economic case was tightest exactly where the collective most wants to work—long-context agentic—because the KV wall put it there. Sparse attention moves the wall out by roughly an order of magnitude in the memory dimension. The box does not become infinite; the question changes from “we run out of memory after one power user” to “how much throughput does the box have”, which is measurable and fundable.

4.3 Caveats

  1. These are illustrative figures, not measurements. They reuse the conventional numbers audited above and DeepSeek’s published DSA figure; the phase-1 benchmark workstream exists to replace them with measured numbers. Treat the above as order-of-magnitude.
  2. A memory ceiling is not a throughput guarantee. Sixteen million-token sessions fitting in memory does not mean the box can generate tokens fast enough for sixteen heavy users at once. DSA also cuts attention compute—to roughly 27% of the prior generation’s at 1M context, per the tech report—which helps, but the binding constraint moves rather than disappears.
  3. We anchored on the endpoints we can cite—the audited conventional KV figure and DeepSeek’s published 1M figure—rather than inventing V4-Flash’s intermediate per-token head maths.

5 Throughput estimates

Autoregressive decode (token generation) is memory-bandwidth-bound. Each token requires loading the active expert weights from HBM:

\[\text {Active expert weights at 4-bit} \approx 22\text{B params} \times 0.5\text {bytes} = 11\text {GB}\]

\[\text {Theoretical max tokens/s} = \frac{7.1\text {TB/s}}{11\text {GB}} \approx 645\text {tok/s}\]

With realistic overheads (attention computation, KV cache reads/writes, memory controller efficiency):

Scenario Estimated throughput Notes
Single user, single stream 200–400 tok/s Excellent interactive experience
8 concurrent users (batched) 100–200 tok/s per user Continuous batching amortizes overhead
32 concurrent users 30–60 tok/s per user Still responsive for agentic use
Prefill (prompt processing) 2,000–5,000 tok/s Compute-bound; GB300 excels here

For agentic workloads specifically, these numbers are better than they look. Agentic patterns are bursty: the model generates a short tool call (50–200 tokens), waits for tool execution, then processes the tool result as a new prompt. The system spends much of its time in prefill (fast) rather than sustained decode (slower), and the gaps between bursts let other users’ requests interleave.

6 Inference software stack

6.1 Engine: Vllm or SGLang

These are the two mature open-source inference engines. Both support Qwen3–235B-A22B with MoE-aware parallelism.

Vllm is the default choice for most deployments. Key features relevant to this use case:

  • Continuous batching with PagedAttention: dynamically multiplexes requests without wasting GPU memory
  • Chunked prefill: prevents long prompts from blocking other users’ decode steps
  • Automatic prefix caching: if multiple users share the same system prompt (likely in a collective), the KV cache for that prefix is computed once and reused
  • KV offloading via LMCache: tiered GPU → CPU → disk caching for conversation context
  • Expert parallelism for MoE: distributes experts across available compute resources
  • OpenAI-compatible API: drop-in replacement for commercial API endpoints

Launch command for this configuration:

“bash vllm serve QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ
–quantization awq
–tensor-parallel-size 1
–max-model-len 32768
–enable-prefix-caching
–kv-cache-dtype fp8
–gpu-memory-utilization 0.95 ‘’’

SGLang is worth considering if multi-turn conversation is the dominant workload. Its radix tree KV cache automatically discovers shared prefixes across conversation turns, giving ~10% throughput improvement over vLLM for multi-turn workloads without any manual tuning.

6.2 Orchestration: NVIDIA Dynamo or llm-d

For a single-machine deployment serving a small collective, vLLM/SGLang alone may be sufficient. But if we want production-grade features:

NVIDIA Dynamo is NVIDIA’s inference orchestrator. It provides KV cache-aware request routing (sends requests to the GPU that already has their context cached), dynamic batching, and prefill/decode disaggregation. It’s designed for the DGX software stack and integrates with NIM.

llm-d is a Kubernetes-native alternative. It adds KV-cache-aware, LoRA-aware, SLA-aware load balancing via Envoy proxy, plus hierarchical KV offloading and scale-to-zero autoscaling. More relevant if we later scale to multiple machines or want to expose the service over a network with proper access control.

6.3 API layer: NVIDIA NIM

If we want the path of least resistance, NVIDIA NIM packages the entire inference stack (engine + caching + API + monitoring) into a single container. It exposes an OpenAI-compatible API, which means any tool that works with the OpenAI API (LangChain, AutoGen, Claude Code’s API mode, etc.) will work unchanged.

The trade-off is less flexibility than rolling our own vLLM/SGLang stack, but dramatically lower operational overhead. For a collective without a dedicated sysadmin, NIM is probably the right default.

7 De-censoring: removing CCP guardrails

This is covered at a high level in the companion post. Here are the technical details.

7.1 Stage 1: Abliteration (~$100–$200 AUD cloud cost)

Tool: Heretic (fully automatic) or llm-abliteration/DECCP (supports sharded processing for large models).

What it does: Computes the “refusal direction” in the model’s residual stream by contrasting activations on harmful versus harmless prompts, then orthogonalizes the relevant weight matrices against that direction. This is a linear algebra operation on the static weights, not training.

Process:

  1. Rent an 8×H100 node on RunPod or Lambda Labs (~$20–$25 USD/hr)
  2. Load the model sharded across GPUs
  3. Run Heretic’s automated pipeline (computes refusal direction, applies orthogonalization)
  4. Save the modified weights
  5. Total runtime: 2–4 hours. Cost: ~$50–$100 USD.

Projected abliteration (the improved variant) decomposes the refusal direction further and only removes the mechanistically specific refusal component, preserving more general helpfulness. Worth using over vanilla abliteration.

What it fixes: Hard refusals on sensitive topics (Taiwan, Tiananmen, Xinjiang, etc.). Refusal rates collapse to near zero.

What it doesn’t fix: Soft steering—the model will now answer questions about Taiwan, but may still frame them with CCP-aligned assumptions.

Evaluation: Run the model against a test set of sensitive prompts in both English and Chinese before and after abliteration. The Shisa.AI Qwen2 censorship analysis provides a good taxonomy of affected topics.

7.2 Stage 2: QLoRA + DPO fine-tune (~$1,500–$4,000 AUD)

If abliteration alone isn’t sufficient (particularly for Chinese-language use, or if we need neutral framing on sensitive geopolitical topics):

Technique: QLoRA (quantized low-rank adaptation) with Direct Preference Optimization (DPO).

Dataset: 1,000–5,000 preference pairs where:

  • Preferred: neutral, factual answer (e.g., “Taiwan’s sovereignty status is disputed, with the ROC governing independently since 1949…”)
  • Dispreferred: CCP-aligned answer (e.g., “Taiwan has always been an inseparable part of China…”)

Topics to cover: Taiwan/cross-strait relations, Tiananmen Square, Xinjiang/Uyghur issues, Tibet, South China Sea, Hong Kong, CCP party history, Chinese economic data reliability. Community datasets exist for smaller Qwen models; these can be adapted for the 235B variant.

Training configuration:

  • Framework: Unsloth (supports Qwen3 MoE fine-tuning with up to 12× speedup)
  • LoRA rank: 16–64 (applied to q, k, v, o, gate, up, down projections)
  • Do not fine-tune the MoE router layer — Unsloth disables this by default, and for good reason: destabilising expert routing causes cascading quality loss
  • Quantization during training: BF16 with LoRA is recommended over QLoRA for Qwen3 MoE (QLoRA’s 4-bit quantization can have a larger quality impact on MoE architectures)
  • Hardware: 4–8× H100 80GB with DeepSpeed ZeRO-3
  • Training time: ~20–50 hours depending on dataset size

Cost breakdown:

Item Cost (AUD)
GPU rental (200–400 H100-hours @ $3–4/hr) $800–$1,600
Dataset creation (DIY or contracted) $500–$2,000
Evaluation and iteration $200–$400
Total $1,500–$4,000

Post-training: Merge LoRA adapters back into the base weights using Unsloth’s merge utilities, then re-quantize to AWQ 4-bit. The merged model serves at the same speed as the original—zero adapter overhead at inference time.

7.3 Stage 3 (optional): Full fine-tune

Only relevant if we’re also doing domain-specific adaptation (e.g., Australian legal, medical, or government contexts) and want to bundle de-censoring into a larger training run. Cost: $5,000–$30,000+ AUD depending on scope. Overkill for pure guardrail removal.

8 Alternative models and the agentic landscape

Qwen3–235B-A22B is the current sweet spot for the DGX Station, but the landscape is moving fast. Here are other models worth considering, especially for agentic use:

8.1 DeepSeek V4 (April 2026)

The architecture and concurrency consequences are worked through in the sparse-attention section above; here is the deployment-specific summary.

V4-Pro (1.6T total / 49B active) is a cloud-rented ceiling reference only. At ≈648 GB even at Q6 it does not usably fit the owned box—the same failure mode as Kimi K2, worse. Its role is the audition question: here is the best open model money can rent versus the one that fits our box; is the gap acceptable?

V4-Flash (284B total / 13B active) does fit, and it enters the model bake-off as a third arm alongside the Qwen options—not an automatic swap, but a same-compute-class candidate with stronger published agentic numbers and native vLLM deepseek_v4 support.

Two operational caveats specific to V4:

  • Pin vLLM to a known-good commit, not a release tag. There are multiple reports of V4 working and then breaking across vLLM commits (NVIDIA developer forum, May 2026). This sharpens the “never run :latest” discipline rather than adding a new rule.
  • Abliteration lags the architecture. Heretic issue #310 (V4-Flash) is open and unresolved as of May 2026; a community-abliterated V4-Flash exists but only as a GGUF for llama.cpp, so producing a vLLM-servable FP8 de-censored build is still work, not a download. Don’t assume Qwen-level abliteration maturity.

The CCP-guardrail position is unchanged: open weights neutralise the dependency problem, not the alignment one. V4 only changes which model sits under the de-censoring knife.

8.2 Moonshot AI’s Kimi K2 (July 2025)

Kimi K2 is a 1-trillion-parameter MoE model with 32B active parameters per token. It was specifically designed for agentic workloads, with strong tool-calling, multi-step reasoning, and code generation. Benchmarks show it competing with Claude Sonnet 4 and GPT-4o on agentic tasks, outperforming GPT-4.1 on coding benchmarks.

Fit on DGX Station: At 1T total parameters, BF16 won’t fit (2 TB). At FP8: ~1 TB—still too large for the 252 GB HBM. At 4-bit: ~500 GB—doesn’t fit in HBM alone, but does fit in the combined 784 GB unified memory with aggressive CPU offloading.

The performance penalty of CPU offloading is significant: expert weights that land in LPDDR5X rather than HBM will load at ~18× lower bandwidth. Depending on how many experts route to CPU-resident weights, we might see decode throughput drop to 30–80 tok/s for single-user inference. This is still usable for agentic workloads (where latency tolerance is higher than in chat), but not ideal.

Cost delta vs Qwen3–235B: Hardware is the same—the DGX Station can run both. The difference is in the de-censoring: Kimi K2 also has Chinese-government-aligned guardrails, and as a newer model with a different architecture, community abliteration tooling may lag behind. Budget an additional $500–$1,000 for de-censoring iteration if we go this route.

8.3 DeepSeek R1 and successors

DeepSeek’s R1 is a reasoning-focused model with chain-of-thought capabilities. The distilled variants (1.5B, 7B, 8B, 14B, 32B, 70B) are small enough to run on much cheaper hardware and useful as “fast path” models for simpler queries, reserving the 235B for complex tasks.

A hybrid deployment—DeepSeek-R1-Distill-Qwen-32B for routine queries, Qwen3–235B for heavy reasoning—would give better aggregate throughput at the same hardware cost.

8.4 Meta’s Llama 4

Meta’s Llama series doesn’t carry CCP guardrails (it has its own, more Western-aligned refusal patterns). If de-censoring is too much hassle, Llama 4 variants (including the 400B-parameter Maverick MoE with 17B active parameters) are a lower-friction option, though as of early 2026 the largest open Llama models don’t quite match Qwen3–235B on benchmarks.

9 Cloud audition pathway

Before committing to hardware, test the full stack on rented compute. Here’s the recommended sequence:

9.1 Phase 1: Quick test (1–2 days, ~$100–$200 AUD)

  1. Rent a single H100 80GB on RunPod ($2–$3 USD/hr)
  2. Deploy Qwen3–30B-A3B (the smaller MoE variant) with vLLM
  3. Test our actual workloads: agentic tool-calling, document processing, code generation
  4. Validate that the vLLM/SGLang API is compatible with our tools

9.2 Phase 2: Full-scale test (1–2 weeks, ~$500–$1,000 AUD)

  1. Rent 4–8× H100s on Lambda Labs or RunPod
  2. Deploy Qwen3–235B-A22B at AWQ 4-bit with vLLM
  3. Run abliteration and test the de-censored model
  4. Simulate multi-user load (5–15 concurrent users with realistic agentic workloads)
  5. Measure throughput, latency, and KV cache pressure

9.3 Phase 3: DGX Cloud validation (optional, ~$1,000–$2,000 AUD)

  1. Access DGX Cloud via AWS, Azure, or GCP marketplace
  2. Deploy using NIM containers
  3. Validate that the full NIM/NGC software stack works with our model and configuration
  4. This ensures seamless migration when we receive the physical hardware

9.4 Phase 4: Purchase and deploy

  1. Order DGX Station through an Australian vendor (XENON, Dell, MSI)
  2. While waiting for delivery, finalise the de-censored model weights on cloud
  3. On arrival: transfer weights, deploy vLLM/NIM, configure access control
  4. Monitor for a week, then open to full collective use

10 Workflow portability: NIM + NGC as the abstraction layer

The single most important architectural decision for long-term flexibility: build our workflows against NIM APIs and NGC containers, not against any specific cloud provider’s tooling.

NIM (NVIDIA Inference Microservices) provides an OpenAI-compatible API that works identically on DGX Cloud, any major hyperscaler, and our own DGX Station. NGC (NVIDIA GPU Cloud) containers are the deployment units.

If we standardise on this layer, we can:

  • Develop and test on rented cloud GPUs
  • Deploy to our own hardware with no code changes
  • Fall back to cloud if our hardware fails or demand spikes
  • Migrate between cloud providers without rewriting anything

The DGX ecosystem’s portability story is genuinely good here—it’s one of the few areas where vendor lock-in actually works in our favour, because the vendor (NVIDIA) has a strong incentive to make its software run everywhere NVIDIA hardware exists.

11 Cost summary

Item One-time cost (AUD) Ongoing monthly (AUD)
DGX Station GB300 $135,000–$195,000
UPS + electrical work $2,000–$5,000
De-censoring (abliteration + DPO) $1,500–$4,000
Cloud audition (phases 1–3) $800–$3,200
Total one-time ~$140,000–$207,000
Electricity (1600W @ 24/7) ~$350
Hardware amortization (3yr) ~$4,500
Internet (business NBN or colo, see risks) $150–$1,000
Total monthly ~$5,000–$5,900
Per member/month (50 members) ~$100–$118

The one-time cost amortized over three years adds roughly $55–$75/month per member in a 50-person collective. Combined with operating costs, total cost of ownership is $155–$193/month per member depending mainly on whether we host at home or in a colo facility. This is competitive with commercial API access for serious users—and with the advantages of no rate limits, no per-query metering, and full data sovereignty.

12 Risks and mitigations

12.1 Hardware failure

The DGX Station is a single point of failure. Mitigation: NVIDIA offers enterprise support contracts through partners like XENON; budget $10,000–$20,000/year for next-business-day replacement coverage. Alternatively, the collective falls back to cloud inference during downtime—if our workflows are built on NIM, the failover is seamless.

12.2 Model obsolescence

A model that’s frontier-class today won’t be in 18 months. Mitigation: the hardware itself doesn’t become obsolete—it can run whatever future models fit in its memory. The 784 GB unified memory is generous enough that it will remain relevant through several generations of open-weight models.

12.3 Network reliability

Australian residential broadband—especially older HFC and FTTN NBN connections—is not reliable enough for a production service. Multiple brief outages per day are common, and asymmetric upload speeds (often 20–40 Mbps) limit remote access quality. Token generation bandwidth is modest (a few KB/s per user), so throughput is not the problem; connection stability is.

Mitigation options, in ascending order of cost and reliability:

  • Residential NBN + 5G failover: ~$150/month total. Handles most brief outages automatically via a dual-WAN router. Good enough for a tolerant collective.
  • Business-grade NBN (e.g. ABB Business or Superloop Business): ~$150–$250/month. Better SLA, static IP, priority fault response. Still on shared infrastructure.
  • Quarter-rack colocation: ~$500–$1,000/month at a facility like Equinix SY1–SY5 or NEXTDC S2. Redundant fibre, generator backup, physical security. Loses the “under a desk” community feel but gains genuine uptime.

The right choice depends on how many members access the machine remotely versus on the local network, and how tolerant the collective is of occasional downtime during agentic workflows.

12.4 Operational complexity

Someone needs to administer this. Mitigation: NIM reduces operational overhead to “run a Docker container.” Updates are pulled from NGC. Access control can be managed through standard reverse proxy setups (nginx + OAuth2). This is weekend-sysadmin-level complexity, not full-time-ops-team complexity.

14 Next steps

If we’re interested in forming or joining a compute collective, the immediate actions are:

  1. Gauge interest: find 10–20 people who would commit to ~$150/month for sovereign AI access
  2. Run a cloud pilot: $200 buys a weekend of testing on rented GPUs
  3. Choose a legal structure: start as an incorporated association, convert to a cooperative if the model works (see legal structure section above)
  4. Order hardware: current lead times for DGX Station are “months, not weeks”—starting the order process early is important
  5. Document everything: the playbook we write becomes the replication kit for the next collective

The companion post makes the case for why. This post, I hope, makes the case for how.