Sovereign compute for small collectives: a technical implementation guide

Hardware, software, de-censoring, and the path from cloud audition to owned infrastructure

2026-03-22 — 2026-03-22

Wherein a DGX Station’s 1600W appetite is accounted for, with an Australian tariff yielding ~$350 monthly, and a cloud audition on H100s is prescribed ere purchase.

AI
cooperation
faster pussycat
institutions
straya
wonk
engineering
Figure 1

This is the technical companion to my sovereign LLM post, which makes the institutional and geopolitical case for small collectives owning their own AI inference hardware. This post lays out the concrete implementation pathway: what to buy, what software to run, how to remove CCP guardrails from open-weight models, and what performance to expect.

There was a lot of research to make this work, so it is heavily AI-assisted. So be wary of hallucinations.

I’ll walk through a worked example targeting Alibaba’s Qwen3–235B-A22B served on an NVIDIA DGX Station GB300, since that’s the sweet spot I identified in the companion post. But most of this generalizes to other models and hardware.

1 Hardware: DGX Station GB300

1.1 Specifications

Component Specification
GPU NVIDIA GB300 Grace Blackwell Ultra Superchip
GPU memory (HBM3e) 252 GB @ 7.1 TB/s bandwidth
CPU memory (LPDDR5X) 496 GB @ 396 GB/s bandwidth
Total unified memory 784 GB
AI compute ~20 PFLOPs (FP4)
CPU-GPU interconnect 900 GB/s NVLink-C2C
Networking 800 Gb/s CX8 SuperNIC
Power 1600W TDP
Form factor Desktop tower

1.2 Australian purchasing options

The DGX Station ships through NVIDIA’s partner network. Australian vendors include:

  • XENON Systems — NVIDIA Elite Partner for ANZ, handles full DGX product line including enterprise support contracts
  • Dell Australia — official NVIDIA Partner Network reseller
  • MSI — its XpertStation WS300 is the DGX Station reference design, listed at $85,000 USD
  • MMT — Australia’s NVIDIA distributor, handy for configure-to-order options

We should expect $135,000–$195,000 AUD landed, depending on configuration. We should budget for GST, shipping, and a UPS (we do not want 1600W of AI compute on raw mains power in Australia).

1.3 Power and cooling

1600W sustained is significant but not exotic—it’s about the same as a large space heater. Standard Australian residential outlets are 10A at 240V (2400W), which is fine if it has its own circuit; we’d want an electrician to install a dedicated 20A circuit for anything larger(4800W at 240V), which is a routine job. We’ll want adequate room ventilation or a dedicated cooling solution if we keep the machine in a small space.

Monthly power cost at Australian retail rates (~$0.30/kWh): ~$350 AUD running 24/7.

2 Model selection and quantization

2.1 Why Qwen3–235B-A22B

This is a mixture-of-experts (MoE) model with 128 expert feed-forward networks per layer, of which 8 activate per token. The key advantage of MoE for self-hosted inference:

  • Total parameters: 235B (determines model size on disk/memory)
  • Active parameters per token: ~22B (determines compute cost)
  • Effective capability: competitive with dense models of 70–100B parameters, at a fraction of the compute cost per token

The architecture uses grouped-query attention (GQA) with 64 query heads, 4 KV heads, 128 head dimension, and 94 transformer layers.

2.2 Quantization trade-offs

Quantization Model size in memory Fits in 252GB HBM? HBM remaining for KV cache Quality impact
BF16 (full) ~470 GB No N/A Baseline
FP8 ~235 GB Barely ~17 GB (unusable) Minimal
AWQ 4-bit ~124 GB Comfortable ~128 GB Modest; MoE models are resilient to quantization since only 8/128 experts activate
GPTQ Int4 ~124 GB Comfortable ~128 GB Similar to AWQ

Recommendation: AWQ 4-bit. The 128 GB of free HBM is critical—it’s our KV cache budget, which directly determines how many concurrent conversations the machine can serve.

FP8 looks tempting (better quality) but leaves only 17 GB for KV cache, forcing constant offloading to CPU memory at 18× lower bandwidth. In practice, this makes multi-user serving unworkable.

Pre-quantized weights are available on HuggingFace (e.g.’QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ’) or we can quantize them ourselves using AutoAWQ.

3 KV cache budget and concurrency

With the model in AWQ 4-bit occupying ~124 GB of HBM, we have ~128 GB remaining for KV cache (conversation context).

3.1 Per-token KV cache cost

Qwen3–235B uses GQA with 4 KV heads, 128 head dimension, 94 layers:

\[\text {KV per token} = 2 \times n_\text {kv\_heads} \times d_\text {head} \times n_\text {layers} \times \text {bytes} = 2 \times 4 \times 128 \times 94 \times 2 \approx 192\text {KB (FP16)}\]

With FP8 KV quantization (supported by Vllm and SGLang with minimal quality loss): ~96 KB/token.

3.2 Concurrency table

KV precision Total tokens in HBM Concurrent sessions @ 8K ctx @ 32K ctx @ 128K ctx
FP16 ~714K ~89 ~22 ~5
FP8 ~1.4M ~178 ~44 ~11

For a 50-person collective where maybe 5–15 people are actively using the system at once, FP8 KV quantization at 8K—32K context is very comfortable.

3.3 KV offloading to CPU memory

The 496 GB of LPDDR5× CPU memory is available as a “warm tier” for KV cache. At 396 GB/s bandwidth (vs 7.1 TB/s for HBM), it’s roughly 18× slower for KV operations, but perfectly adequate for sessions that have been idle for a few seconds.

This effectively adds another ~2.6 million tokens of context storage (at FP8 KV), enough to keep dozens of “sleeping” conversations warm without re-prefilling them when the user returns.

Both Vllm (via LMCache connector) and SGLang (native) support this tiered KV architecture.

4 Throughput estimates

Autoregressive decode (token generation) is memory-bandwidth-bound. Each token requires loading the active expert weights from HBM:

\[\text {Active expert weights at 4-bit} \approx 22\text{B params} \times 0.5\text {bytes} = 11\text {GB}\]

\[\text {Theoretical max tokens/s} = \frac{7.1\text {TB/s}}{11\text {GB}} \approx 645\text {tok/s}\]

With realistic overheads (attention computation, KV cache reads/writes, memory controller efficiency):

Scenario Estimated throughput Notes
Single user, single stream 200–400 tok/s Excellent interactive experience
8 concurrent users (batched) 100–200 tok/s per user Continuous batching amortizes overhead
32 concurrent users 30–60 tok/s per user Still responsive for agentic use
Prefill (prompt processing) 2,000–5,000 tok/s Compute-bound; GB300 excels here

For agentic workloads specifically, these numbers are better than they look. Agentic patterns are bursty: the model generates a short tool call (50–200 tokens), waits for tool execution, then processes the tool result as a new prompt. The system spends much of its time in prefill (fast) rather than sustained decode (slower), and the gaps between bursts let other users’ requests interleave.

5 Inference software stack

5.1 Engine: Vllm or SGLang

These are the two mature open-source inference engines. Both support Qwen3–235B-A22B with MoE-aware parallelism.

Vllm is the default choice for most deployments. Key features relevant to this use case:

  • Continuous batching with PagedAttention: dynamically multiplexes requests without wasting GPU memory
  • Chunked prefill: prevents long prompts from blocking other users’ decode steps
  • Automatic prefix caching: if multiple users share the same system prompt (likely in a collective), the KV cache for that prefix is computed once and reused
  • KV offloading via LMCache: tiered GPU → CPU → disk caching for conversation context
  • Expert parallelism for MoE: distributes experts across available compute resources
  • OpenAI-compatible API: drop-in replacement for commercial API endpoints

Launch command for this configuration:

“bash vllm serve QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ
–quantization awq
–tensor-parallel-size 1
–max-model-len 32768
–enable-prefix-caching
–kv-cache-dtype fp8
–gpu-memory-utilization 0.95 ’’’

SGLang is worth considering if multi-turn conversation is the dominant workload. Its radix tree KV cache automatically discovers shared prefixes across conversation turns, giving ~10% throughput improvement over Vllm for multi-turn workloads without any manual tuning.

5.2 Orchestration: NVIDIA Dynamo or llm-d

For a single-machine deployment serving a small collective, Vllm/SGLang alone may be sufficient. But if we want production-grade features:

NVIDIA Dynamo is NVIDIA’s inference orchestrator. It provides KV cache-aware request routing (sends requests to the GPU that already has their context cached), dynamic batching, and prefill/decode disaggregation. It’s designed for the DGX software stack and integrates with NIM.

llm-d is a Kubernetes-native alternative. It adds KV-cache-aware, LoRA-aware, SLA-aware load balancing via Envoy proxy, plus hierarchical KV offloading and scale-to-zero autoscaling. More relevant if we later scale to multiple machines or want to expose the service over a network with proper access control.

5.3 API layer: NVIDIA NIM

If we want the path of least resistance, NVIDIA NIM packages the entire inference stack (engine + caching + API + monitoring) into a single container. It exposes an OpenAI-compatible API, which means any tool that works with the OpenAI API (LangChain, AutoGen, Claude Code’s API mode, etc.) will work unchanged.

The trade-off is less flexibility than rolling our own Vllm/SGLang stack, but dramatically lower operational overhead. For a collective without a dedicated sysadmin, NIM is probably the right default.

6 De-censoring: removing CCP guardrails

This is covered at a high level in the companion post. Here are the technical details.

6.1 Stage 1: Abliteration (~$100–$200 AUD cloud cost)

Tool: Heretic (fully automatic) or llm-abliteration/DECCP (supports sharded processing for large models).

What it does: Computes the “refusal direction” in the model’s residual stream by contrasting activations on harmful vs harmless prompts, then orthogonalizes the relevant weight matrices against that direction. This is a linear algebra operation on the static weights, not training.

Process:

  1. Rent an 8×H100 node on RunPod or Lambda Labs (~$20–$25 USD/hr)
  2. Load the model sharded across GPUs
  3. Run Heretic’s automated pipeline (computes refusal direction, applies orthogonalization)
  4. Save the modified weights
  5. Total runtime: 2–4 hours. Cost: ~$50–$100 USD.

Projected abliteration (the improved variant) decomposes the refusal direction further and only removes the mechanistically specific refusal component, preserving more general helpfulness. Worth using over vanilla abliteration.

What it fixes: Hard refusals on sensitive topics (Taiwan, Tiananmen, Xinjiang, etc.). Refusal rates collapse to near zero.

What it doesn’t fix: Soft steering—the model will now answer questions about Taiwan, but may still frame them with CCP-aligned assumptions.

Evaluation: Run the model against a test set of sensitive prompts in both English and Chinese before and after abliteration. The Shisa.AI Qwen2 censorship analysis provides a good taxonomy of affected topics.

6.2 Stage 2: QLoRA + DPO fine-tune (~$1,500–$4,000 AUD)

If abliteration alone isn’t sufficient (particularly for Chinese-language use, or if we need neutral framing on sensitive geopolitical topics):

Technique: QLoRA (quantized low-rank adaptation) with Direct Preference Optimisation (DPO).

Dataset: 1,000–5,000 preference pairs where:

  • Preferred: neutral, factual answer (e.g., “Taiwan’s sovereignty status is disputed, with the ROC governing independently since 1949…”)
  • Dispreferred: CCP-aligned answer (e.g., “Taiwan has always been an inseparable part of China…”)

Topics to cover: Taiwan/cross-strait relations, Tiananmen Square, Xinjiang/Uyghur issues, Tibet, South China Sea, Hong Kong, CCP party history, Chinese economic data reliability. Community datasets exist for smaller Qwen models; these can be adapted for the 235B variant.

Training configuration:

  • Framework: Unsloth (supports Qwen3 MoE fine-tuning with up to 12× speedup)
  • LoRA rank: 16–64 (applied to q, k, v, o, gate, up, down projections)
  • Do not fine-tune the MoE router layer — Unsloth disables this by default, and for good reason: destabilizing expert routing causes cascading quality loss
  • Quantization during training: BF16 with LoRA is recommended over QLoRA for Qwen3 MoE (QLoRA’s 4-bit quantization can have larger quality impact on MoE architectures)
  • Hardware: 4–8× H100 80GB with DeepSpeed ZeRO-3
  • Training time: ~20–50 hours depending on dataset size

Cost breakdown:

Item Cost (AUD)
GPU rental (200–400 H100-hours @ $3–4/hr) $800–$1,600
Dataset creation (DIY or contracted) $500–$2,000
Evaluation and iteration $200–$400
Total $1,500–$4,000

Post-training: Merge LoRA adapters back into the base weights using Unsloth’s merge utilities, then re-quantize to AWQ 4-bit. The merged model serves at the same speed as the original—zero adapter overhead at inference time.

6.3 Stage 3 (optional): Full fine-tune

Only relevant if we’re also doing domain-specific adaptation (e.g., Australian legal, medical, or government contexts) and want to bundle de-censoring into a larger training run. Cost: $5,000–$30,000+ AUD depending on scope. Overkill for pure guardrail removal.

7 Alternative models and the agentic landscape

Qwen3–235B-A22B is the current sweet spot for the DGX Station, but the landscape is moving fast. Here are other models worth considering, especially for agentic use:

7.1 Moonshot AI’s Kimi K2 (July 2025)

Kimi K2 is a 1-trillion-parameter MoE model with 32B active parameters per token. It was specifically designed for agentic workloads, with strong tool-calling, multi-step reasoning, and code generation. Benchmarks show it competing with Claude Sonnet 4 and GPT-4o on agentic tasks, outperforming GPT-4.1 on coding benchmarks.

Fit on DGX Station: At 1T total parameters, BF16 won’t fit (2 TB). At FP8: ~1 TB—still too large for the 252 GB HBM. At 4-bit: ~500 GB—doesn’t fit in HBM alone, but does fit in the combined 784 GB unified memory with aggressive CPU offloading.

The performance penalty of CPU offloading is significant: expert weights that land in LPDDR5× rather than HBM will load at ~18× lower bandwidth. Depending on how many experts route to CPU-resident weights, we might see decode throughput drop to 30–80 tok/s for single-user inference. This is still usable for agentic workloads (where latency tolerance is higher than chat), but not ideal.

Cost delta vs Qwen3–235B: Hardware is the same—the DGX Station can run both. The difference is in the de-censoring: Kimi K2 also has Chinese-government-aligned guardrails, and as a newer model with a different architecture, community abliteration tooling may lag behind. Budget an additional $500–$1,000 for de-censoring iteration if we go this route.

7.2 DeepSeek R1 and successors

DeepSeek’s R1 is a reasoning-focused model with chain-of-thought capabilities. The distilled variants (1.5B, 7B, 8B, 14B, 32B, 70B) are small enough to run on much cheaper hardware and useful as “fast path” models for simpler queries, reserving the 235B for complex tasks.

A hybrid deployment—DeepSeek-R1-Distill-Qwen-32B for routine queries, Qwen3–235B for heavy reasoning—would give better aggregate throughput at the same hardware cost.

7.3 Meta’s Llama 4

Meta’s Llama series doesn’t carry CCP guardrails (it has its own, more Western-aligned refusal patterns). If de-censoring is too much hassle, Llama 4 variants (including the 400B-parameter Maverick MoE with 17B active parameters) are a lower-friction option, though as of early 2026 the largest open Llama models don’t quite match Qwen3–235B on benchmarks.

8 Cloud audition pathway

Before committing to hardware, test the full stack on rented compute. Here’s the recommended sequence:

8.1 Phase 1: Quick test (1–2 days, ~$100–$200 AUD)

  1. Rent a single H100 80GB on RunPod ($2–$3 USD/hr)
  2. Deploy Qwen3–30B-A3B (the smaller MoE variant) with Vllm
  3. Test our actual workloads: agentic tool-calling, document processing, code generation
  4. Validate that the Vllm/SGLang API is compatible with our tools

8.2 Phase 2: Full-scale test (1–2 weeks, ~$500–$1,000 AUD)

  1. Rent 4–8× H100s on Lambda Labs or RunPod
  2. Deploy Qwen3–235B-A22B at AWQ 4-bit with Vllm
  3. Run abliteration and test the de-censored model
  4. Simulate multi-user load (5–15 concurrent users with realistic agentic workloads)
  5. Measure throughput, latency, and KV cache pressure

8.3 Phase 3: DGX Cloud validation (optional, ~$1,000–$2,000 AUD)

  1. Access DGX Cloud via AWS, Azure, or GCP marketplace
  2. Deploy using NIM containers
  3. Validate that the full NIM/NGC software stack works with our model and configuration
  4. This ensures seamless migration when we receive the physical hardware

8.4 Phase 4: Purchase and deploy

  1. Order DGX Station through an Australian vendor (XENON, Dell, MSI)
  2. While waiting for delivery, finalise the de-censored model weights on cloud
  3. On arrival: transfer weights, deploy Vllm/NIM, configure access control
  4. Monitor for a week, then open to full collective use

9 Workflow portability: NIM + NGC as the abstraction layer

The single most important architectural decision for long-term flexibility: build our workflows against NIM APIs and NGC containers, not against any specific cloud provider’s tooling.

NIM (NVIDIA Inference Microservices) provides an OpenAI-compatible API that works identically on DGX Cloud, any major hyperscaler, and our own DGX Station. NGC (NVIDIA GPU Cloud) containers are the deployment units.

If we standardize on this layer, we can:

  • Develop and test on rented cloud GPUs
  • Deploy to our own hardware with no code changes
  • Fall back to cloud if our hardware fails or demand spikes
  • Migrate between cloud providers without rewriting anything

The DGX ecosystem’s portability story is genuinely good here—it’s one of the few areas where vendor lock-in actually works in our favour, because the vendor (NVIDIA) has a strong incentive to make its software run everywhere NVIDIA hardware exists.

10 Cost summary

Item One-time cost (AUD) Ongoing monthly (AUD)
DGX Station GB300 $135,000–$195,000
UPS + electrical work $2,000–$5,000
De-censoring (abliteration + DPO) $1,500–$4,000
Cloud audition (phases 1–3) $800–$3,200
Total one-time ~$140,000–$207,000
Electricity (1600W @ 24/7) ~$350
Hardware amortisation (3yr) ~$4,500
Internet (business NBN or colo, see risks) $150–$1,000
Total monthly ~$5,000–$5,900
Per member/month (50 members) ~$100–$118

The one-time cost amortised over three years adds roughly $55–$75/month per member in a 50-person collective. Combined with operating costs, total cost of ownership is $155–$193/month per member depending mainly on whether we host at home or in a colo facility. This is competitive with commercial API access for serious users—and with the advantages of no rate limits, no per-query metering, and full data sovereignty.

11 Risks and mitigations

11.1 Hardware failure

The DGX Station is a single point of failure. Mitigation: NVIDIA offers enterprise support contracts through partners like XENON; budget $10,000–$20,000/year for NBD replacement coverage. Alternatively, the collective falls back to cloud inference during downtime—if our workflows are built on NIM, the failover is seamless.

11.2 Model obsolescence

A model that’s frontier-class today won’t be in 18 months. Mitigation: the hardware itself doesn’t become obsolete—it can run whatever future models fit in its memory. The 784 GB unified memory is generous enough that it will remain relevant through several generations of open-weight models.

11.3 Network reliability

Australian residential broadband—especially older HFC and FTTN NBN connections—is not reliable enough for a production service. Multiple brief outages per day are common, and asymmetric upload speeds (often 20–40 Mbps) limit remote access quality. Token generation bandwidth is modest (a few KB/s per user), so throughput isn’t the problem; connection stability is.

Mitigation options, in ascending order of cost and reliability:

  • Residential NBN + 5G failover: ~$150/month total. Handles most brief outages automatically via a dual-WAN router. Good enough for a tolerant collective.
  • Business-grade NBN (e.g. ABB Business or Superloop Business): ~$150–$250/month. Better SLA, static IP, priority fault response. Still on shared infrastructure.
  • Quarter-rack colocation: ~$500–$1,000/month at a facility like Equinix SY1–SY5 or NEXTDC S2. Redundant fibre, generator backup, physical security. Loses the “under a desk” community feel but gains genuine uptime.

The right choice depends on how many members access the machine remotely versus on the local network, and how tolerant the collective is of occasional downtime during agentic workflows.

11.4 Operational complexity

Someone needs to administer this. Mitigation: NIM reduces operational overhead to “run a Docker container.” Updates are pulled from NGC. Access control can be managed through standard reverse proxy setups (nginx + Oauth2). This is weekend-sysadmin-level complexity, not full-time-ops-team complexity.

12 Next steps

If we’re interested in forming or joining a compute collective, the immediate actions are:

  1. Gauge interest: find 10–20 people who would commit to ~$150/month for sovereign AI access
  2. Run a cloud pilot: $200 buys a weekend of testing on rented GPUs
  3. Choose a legal structure: an incorporated association is probably the lightest-weight option in Australia
  4. Order hardware: current lead times for DGX Station are “months, not weeks”—starting the order process early is important
  5. Document everything: the playbook we write becomes the replication kit for the next collective

The companion post makes the case for why. This post, I hope, makes the case that it’s how.