Sovereign compute for small collectives: technical implementation guide

Hardware, software, de-censoring, and the path from cloud audition to owned infrastructure

2026-03-22 — 2026-05-21

Wherein the Specifications of a Shared Inference Server Are Laid Out, Its Legal Ownership Is Settled, and the Excision of Political Guardrails From Open-Weight Models Is Treated as Routine.

AI
community project
cooperation
diy
engineering
faster pussycat
institutions
resilient tech
straya
wonk

Follow along while we work this out— danmackinlay/SOV or Sign up for updates.

Figure 1

This is the technical companion to my sovereign LLM post, which makes the institutional and geopolitical case for small collectives owning their own AI inference hardware. This post lays out the concrete implementation pathway: what to buy, what software to run, how to remove CCP guardrails from open-weight models, and what performance to expect.

There was a lot of research to make this work, so it is heavily AI-assisted. Be wary of hallucinations.

I’ll walk through a worked example targeting Alibaba’s Qwen3–235B-A22B served on an NVIDIA DGX Station GB300, since that’s the sweet spot I identified in the companion post. But most of this generalises to other models and hardware.

Sounds interesting? Get in touch.

1 The box

1.1 Specifications

Component Specification
GPU NVIDIA GB300 Grace Blackwell Ultra Superchip
GPU memory (HBM3e) 252 GB @ 7.1 TB/s bandwidth
CPU memory (LPDDR5X) 496 GB @ 396 GB/s bandwidth
Total unified memory 784 GB
AI compute ~20 PFLOPs (FP4)
CPU-GPU interconnect 900 GB/s NVLink-C2C
Networking 800 Gb/s CX8 SuperNIC
Power 1600W TDP
Form factor Desktop tower

1.2 Buying one in Australia

The DGX Station ships through NVIDIA’s partner network. Australian vendors include XENON Systems (NVIDIA Elite Partner for ANZ, handles the full DGX line and enterprise support), Dell Australia, MSI (its XpertStation WS300 is the DGX Station reference design, listed at $85,000 USD), and MMT, Australia’s NVIDIA distributor for configure-to-order options.

Expect $135,000–$195,000 AUD landed depending on configuration, plus GST, shipping, and a UPS — we do not want 1600W of AI compute on raw mains power.

1.3 Power and cooling

1600W sustained is about the draw of a large space heater. Standard Australian outlets are 10 A at 240 V (2400 W), fine on a dedicated circuit; an electrician installing a 20 A circuit (4800 W) is a routine job if we want headroom. We’ll want decent ventilation if the machine lives in a small room. At Australian retail rates (~$0.30/kWh), running 24/7 costs about ~$350 AUD/month.

2 What the box rations

Everything below is a way of spending the resources of one box, so it helps to name them first. The DGX Station has three scarce quantities, and almost every design choice trades one against another.

Fast memory — 252 GB of HBM3e. It holds the model weights and the live conversations. Weights are a fixed cost paid once; whatever is left over after the weights sets how many people can talk to the machine at the same time.

Memory bandwidth — 7.1 TB/s out of that HBM. Generating a token means reading weights out of memory, so bandwidth, not compute, decides how fast the machine writes.

Compute — about 20 PFLOPs at FP4. This mostly bites while ingesting a prompt, where the work is one big matrix multiply rather than a memory sweep.

Mixture-of-experts shrinks the compute and bandwidth bill per token. Quantization shrinks the weights so they fit in fast memory. Sparse attention shrinks the conversations so more of them fit. Each technique below stretches one of those three.

3 Models that fit

3.1 Mixture of experts

The baseline I cost throughout is Alibaba’s Qwen3–235B-A22B, a mixture-of-experts model: 128 expert feed-forward networks per layer, of which 8 fire for any given token. It carries 235B parameters in total but only puts ~22B to work per token. Total parameters set the memory footprint; active parameters set the compute and bandwidth cost. That split is what makes self-hosting tractable — we pay 235B to store the model and 22B to run it, and the result is competitive with dense models of 70–100B parameters. The attention is grouped-query: 64 query heads, 4 KV heads, 128 head dimension, 94 transformer layers. Those last numbers come back when we count KV cache.

3.2 Quantization

The weights have to fit in 252 GB of HBM, with plenty left for conversations. Quantization sets how much room is left over.

Quantization Model size in memory Fits in 252GB HBM? HBM left for KV cache Quality impact
BF16 (full) ~470 GB No N/A Baseline
FP8 ~235 GB Barely ~17 GB (unusable) Minimal
AWQ 4-bit ~124 GB Comfortable ~128 GB Modest
GPTQ Int4 ~124 GB Comfortable ~128 GB Similar to AWQ

AWQ 4-bit is the pick. It puts the model in ~124 GB and leaves ~128 GB free, and that free space is the conversation budget — the thing that decides how many people the box serves at once. FP8 looks better on paper but leaves ~17 GB, which forces constant offload to the slow tier and makes multi-user serving fall over. MoE models tolerate 4-bit well, since only 8 of 128 experts touch any token. Pre-quantized weights are on HuggingFace (e.g. QuantTrio/Qwen3–235B-A22B-Instruct-2507-AWQ), or we can quantize our own with AutoAWQ.

3.3 Sparse attention

A conversation’s context lives in the KV cache, and in a conventional transformer that cache grows linearly with length — every token of history costs memory, in the same scarce HBM as the weights, for as long as the conversation lives. Sparse attention breaks that link. DeepSeek’s V4 (April 2026, weights) uses an attention design DeepSeek calls DSA: it compresses the context before storing it, so a full million-token session costs roughly 9.6 GB of KV cache instead of the tens of GB a conventional model of the same class needs — about a 90% cut, per the DeepSeek tech report. It trims attention compute too, to roughly 27% of the previous generation’s at 1M context. For a box whose tightest constraint is long conversations, this relaxes the binding limit more than anything else; the concurrency maths below shows by how much.

4 Concurrency

Fast memory holds weights plus conversations:

fast memory  =  weights (paid once)  +  KV cache (paid per user, per token)

concurrent users  ≈  (memory left after weights)  ÷  (KV cost of one conversation)

Qwen3–235B’s grouped-query attention (4 KV heads, 128 head dimension, 94 layers) costs, per token of context:

\[\text{KV per token} = 2 \times n_\text{kv\_heads} \times d_\text{head} \times n_\text{layers} \times \text{bytes} = 2 \times 4 \times 128 \times 94 \times 2 \approx 192\text{ KB (FP16)}\]

Quantizing the KV cache to FP8 (supported by vLLM and SGLang with little quality loss) halves that to ~96 KB/token. With ~128 GB free after the 4-bit weights, the box holds about 1.4M tokens of context at once. Divide by session length to turn tokens into people:

Session length Concurrent users
8K tokens ~175
32K tokens ~44
128K tokens ~11
1M tokens ~1.4

Concurrency falls off a cliff as conversations lengthen. A 50-person collective with 5–15 active members sits comfortably at 8K–32K and falls over at 1M, where one member running a single million-token agentic session fills the box. The comfortable numbers assume short conversations — but long-context agentic work, the thing the collective most wants, drives sessions toward the bottom row, exactly where the box is tightest.

Sparse attention is the way out of that corner. On DeepSeek V4-Flash a full 1M-token session costs ~9.6 GB of KV rather than the tens of GB above, and its ~90 GB of 4-bit weights leave ~160 GB for cache. So the bottom row moves from ~1.4 users to ~16, and at 128K and below the per-session cost is small enough that KV stops being the binding constraint at all:

                        conventional model      V4-Flash + DSA
  @ 1M-token session  :     ≈ 1.4 users    →      ≈ 16 users
  @ 128K-token session:     ≈ 11 users     →      KV no longer the limit

The box does not become infinite. Sixteen million-token sessions fitting in memory is not the same as the box generating tokens fast enough for sixteen heavy users at once — DSA’s compute cut helps, but the binding constraint moves from memory to throughput rather than vanishing. And these are illustrative figures: they reuse the conventional numbers above and DeepSeek’s published 1M figure rather than measured intermediate points, which is what phase 2 of the audition is for. Treat them as order-of-magnitude.

4.1 The warm tier

The 496 GB of LPDDR5X CPU memory works as a warm tier for KV cache. At 396 GB/s it’s roughly 18× slower than HBM — fine for sessions that have been idle a few seconds. That’s room for another ~2.6M tokens (FP8 KV) of sleeping conversations the box can resume without re-prefilling them. Both vLLM via LMCache and SGLang (native) support this tiering.

5 Throughput

Decode — generating one token at a time — is bandwidth-bound: each token reads the active expert weights out of HBM.

\[\text{active expert weights at 4-bit} \approx 22\text{B} \times 0.5\text{ bytes} = 11\text{ GB}\]

\[\text{theoretical max} \approx \frac{7.1\text{ TB/s}}{11\text{ GB}} \approx 645\text{ tok/s}\]

After attention, KV traffic, and memory-controller overhead:

Scenario Estimated throughput Notes
Single user, single stream 200–400 tok/s Excellent interactive experience
8 concurrent users (batched) 100–200 tok/s per user Continuous batching amortizes overhead
32 concurrent users 30–60 tok/s per user Still responsive for agentic use
Prefill (prompt processing) 2,000–5,000 tok/s Compute-bound; GB300 excels here

For agentic workloads these read better than they look. Agentic traffic is bursty: a short tool call (50–200 tokens), a wait for the tool to run, then the result arrives as a fresh prompt. Much of the time goes to prefill (fast) rather than sustained decode (slow), and the gaps let other users’ requests interleave.

6 Serving software

6.1 Inference engine

vLLM and SGLang are the two mature open-source engines; both serve Qwen3–235B with MoE-aware parallelism. vLLM is the default. The features that matter here:

  • Continuous batching with PagedAttention — multiplexes requests without wasting GPU memory
  • Chunked prefill — stops long prompts blocking other users’ decode steps
  • Automatic prefix caching — a shared system prompt (likely in a collective) is computed once and reused
  • KV offloading via LMCache — the GPU → CPU → disk warm tier above
  • Expert parallelism for MoE
  • OpenAI-compatible API — a drop-in replacement for commercial endpoints

A launch command for this configuration:

vllm serve QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95

SGLang earns its place if multi-turn conversation dominates: its radix-tree KV cache discovers shared prefixes across turns automatically, for ~10% more throughput than vLLM on that workload with no manual tuning.

For DeepSeek V4 specifically, pin vLLM to a known-good commit, not a release tag. There are multiple reports of V4 working and then breaking across vLLM commits (NVIDIA developer forum, May 2026), which sharpens the never-run-:latest discipline rather than adding a new rule.

6.2 Orchestration

For a single box serving a small collective, vLLM or SGLang alone may be enough. If we want production-grade routing, NVIDIA Dynamo adds KV-cache-aware request routing (send a request to the GPU that already holds its context), dynamic batching, and prefill/decode disaggregation, and it’s built for the DGX stack. llm-d is the Kubernetes-native alternative, with SLA-aware load balancing, hierarchical KV offload, and scale-to-zero — more relevant once we run several machines or expose the service over a network.

6.3 NIM and portability

The path of least resistance is NVIDIA NIM, which packs the whole stack — engine, caching, API, monitoring — into one container behind an OpenAI-compatible API. Anything that speaks the OpenAI API (LangChain, AutoGen, Claude Code’s API mode) works unchanged. For a collective without a dedicated sysadmin that’s probably the right default; the trade-off is less flexibility than a hand-rolled vLLM stack.

The reason to standardise on NIM goes beyond convenience. NIM and NGC containers run identically on DGX Cloud, the major hyperscalers, and our own DGX Station. Build the workflows against that layer and we can develop on rented cloud GPUs, deploy to our own hardware with no code changes, fail back to cloud if the box dies or demand spikes, and migrate between cloud providers without rewriting anything. This is one of the few places vendor lock-in works in our favour: NVIDIA has every incentive to make its software run everywhere its hardware does.

7 Removing the guardrails

The companion post makes the case at a high level; here are the mechanics. Chinese open-weight models ship with CCP-aligned guardrails — hard refusals on Taiwan, Tiananmen, Xinjiang and the rest, plus softer framing biases. Removing them is a solved problem with two tiers of effort.

7.1 Abliteration

Tools: Heretic (fully automatic) or llm-abliteration/DECCP (sharded processing for large models).

Abliteration computes the model’s “refusal direction” in its residual stream — by contrasting activations on harmful versus harmless prompts — then orthogonalises the relevant weight matrices against it. This is linear algebra on the static weights, not training. Rent an 8×H100 node (RunPod or Lambda Labs, ~$20–25 USD/hr), load the model sharded, run the pipeline, save the weights: 2–4 hours, ~$50–100 USD. Projected abliteration only removes the mechanistically specific refusal component and preserves more general helpfulness — use it over the vanilla variant.

Refusal rates collapse to near zero. What it doesn’t fix is soft steering: the model will now answer about Taiwan but may still frame the answer with CCP-aligned assumptions. Evaluate before and after against a test set of sensitive prompts in both English and Chinese; the Shisa.AI Qwen2 censorship analysis is a good taxonomy of affected topics.

7.2 Preference fine-tuning

If abliteration alone isn’t enough — particularly for Chinese-language use, or for neutral framing on geopolitics — go to QLoRA with Direct Preference Optimization. Build 1,000–5,000 preference pairs where the preferred answer is neutral and factual and the dispreferred one is CCP-aligned, across Taiwan, Tiananmen, Xinjiang, Tibet, the South China Sea, Hong Kong, party history, and economic-data reliability. Community datasets exist for smaller Qwen models and adapt to the 235B variant.

Practical configuration:

  • Framework: Unsloth (Qwen3 MoE fine-tuning, up to 12× speedup)
  • LoRA rank 16–64 on q, k, v, o, gate, up, down projections
  • Don’t fine-tune the MoE router — Unsloth disables this by default; destabilising expert routing causes cascading quality loss
  • BF16 with LoRA is recommended over QLoRA for Qwen3 MoE (4-bit training hurts MoE architectures more)
  • 4–8× H100 80GB with DeepSpeed ZeRO-3, ~20–50 hours
Item Cost (AUD)
GPU rental (200–400 H100-hours @ $3–4/hr) $800–$1,600
Dataset creation (DIY or contracted) $500–$2,000
Evaluation and iteration $200–$400
Total $1,500–$4,000

Merge the LoRA adapters back into the base weights with Unsloth, re-quantize to AWQ 4-bit, and the merged model serves at full speed — zero adapter overhead at inference.

A full fine-tune ($5,000–$30,000+) only makes sense if we’re also doing domain adaptation (Australian legal, medical, government) and want to fold de-censoring into a larger run. Overkill for guardrail removal alone.

7.3 Newer models lag the tooling

De-censoring maturity trails the architecture. As of May 2026 the Heretic issue for V4-Flash (#310) is open and unresolved; a community-abliterated V4-Flash exists only as a GGUF for llama.cpp, so a vLLM-servable FP8 de-censored build is still work, not a download. The position itself is unchanged whichever model we pick: open weights settle whether we’re allowed to run the model, not what guardrails were baked in at the alignment stage. V4 just changes which model sits under the knife.

8 Other candidates

Qwen3–235B is the current sweet spot, but it’s not the only model that fits, and the field moves fast.

DeepSeek V4 comes in two open-weight variants. V4-Flash (284B total / 13B active) fits the box and enters the bake-off as a same-compute-class candidate, with stronger published agentic numbers, native vLLM deepseek_v4 support, and the DSA sparse attention that does so much for concurrency. Not an automatic swap for Qwen, but a serious third option. V4-Pro (1.6T / 49B active) does not fit — ~648 GB even at Q6 — so its role is the audition question: here’s the best open model money can rent, is the gap to the one we own acceptable?

Kimi K2 (1T total / 32B active) is agentic-tuned and competitive with Claude Sonnet 4 and GPT-4o on tool-calling and coding. It only fits with aggressive CPU offload (~500 GB at 4-bit, into the 784 GB unified memory), which drops single-user decode to 30–80 tok/s — usable for latency-tolerant agentic work, not ideal — and its de-censoring tooling lags.

Meta’s Llama 4 carries Western rather than CCP guardrails, so it skips the de-censoring step entirely. The largest open Llamas (including the 400B Maverick MoE, 17B active) don’t quite match Qwen3–235B on benchmarks as of early 2026, but that’s the lower-friction option if de-censoring is more hassle than we want.

A hybrid is also on the table: a small fast model (DeepSeek-R1-Distill-Qwen-32B or similar) handles routine queries, the big model is reserved for hard reasoning, and aggregate throughput rises at no extra hardware cost.

9 When smaller hardware makes sense

The DGX Station above is the sweet spot for a 25–50 person collective with serious agentic and reasoning workloads. Several adjacent shapes have different answers, and the model picks change at each tier.

9.1 The small collective (10–15 people)

A $160k box does not pay back at 10 members. The lower tier is a single H100 80 GB on a workstation chassis — Lambda Tensorbook, MSI WS-class, or a custom build with one Hopper PCIe card. Hardware spend lands at roughly $30k–$45k AUD all-in versus $135k–$195k for the DGX.

Models that fit comfortably at AWQ 4-bit on 80 GB:

Model ~Weights Indication
Qwen3.5-35B-A3B ~18 GB MoE daily driver, 3B active. Quality close enough to Qwen3-235B for most everyday use; throughput much higher because only 3B fire per token.
Hermes-4.3-36B ~20 GB Agentic-tuned Nous release, December 2025. The right pick if the workload is tool-using rather than pure reasoning.
DeepSeek-R1-Distill-Qwen-32B ~19 GB Math-and-reasoning distill. Single user can run alongside one of the above on the same H100.

Concurrency budget on an 80 GB H100: about 60 GB of KV-cache headroom, so roughly 12 users at 32K context or ~3 users at 128K — below the DGX figures, but a useful sanity check for a small group. The proposition is roughly “near-frontier quality at 25% of the hardware cost” — appropriate where sovereign compute is a hedge rather than a primary work tool.

9.2 The consumer-GPU pilot (5–10 people, or proof of concept)

Below the H100 tier, a single RTX 4090 24 GB or 5090 32 GB in a workstation is the entry point. Hardware spend ~$5k–$8k AUD all-in.

At 24 GB the practical model list shrinks to 14B-class at 4-bit:

Model ~Weights Indication
Hermes-4-14B ~8 GB Agentic, function-calling. The smaller-tier baseline.
DeepSeek-R1-0528-Qwen3-8B ~5 GB Math champion of its weight class — AIME-2024 86%, matches Qwen3-235B-thinking on math benchmarks.
Phi-4-Reasoning-Plus-14B ~8 GB Different reasoning style; useful alongside the DeepSeek distill.

The use case is a pilot — small group, exploratory workload, willingness to live with smaller models — or a permanent setup for groups whose work is dominated by short-context interactive use. Concurrency falls hard at 24 GB (~3 users at 32K), so this tier rewards short sessions and a retrieval layer over brute-force long context.

9.3 Fast-path / slow-path hybrid

The hybrid pattern mentioned in Other candidates is more useful than its placement suggests. The shape: run two models in parallel.

  • Fast path on a small GPU: Qwen3.5-35B-A3B or DeepSeek-R1-0528-Qwen3-8B for routine chat, code completion, simple tool calls. Sub-second latency, low cost per query.
  • Slow path on the big GPU: Qwen3-235B or V4-Flash for hard reasoning, long-context agentic, multi-step research.

A router — a small classifier model, or an explicit member choice via /model in the harness — sends each request to the appropriate backend. For agentic workloads where most steps are simple but a few need frontier-class reasoning, this is materially cheaper than running everything on the big model.

The pattern lets a collective grow from a single small box to a heterogeneous fleet without throwing away earlier investment — the H100 station that pilots phase 1 becomes the fast path in phase 3.

9.4 Per-member workstations as fallback

A natural shape for resilient infrastructure: each member’s own machine runs a small model as fallback when the colo endpoint is unavailable. Their harness (pi, Hermes, Aider) auto-fails over from the collective backend to a local Ollama or Osaurus endpoint. Smaller and slower, but it works on the day the cables are cut.

For Mac-equipped members, running LLMs locally on a Mac covers the per-member setup. The model picks overlap with this list — Hermes-4-14B and DeepSeek-R1-0528-Qwen3-8B are the obvious shared-tier choices.

9.5 When small is wrong

Three workloads make a single H100 painful, even with the right model picks:

  • Long-context agentic flows (>128K context routinely). KV-cache pressure builds fast on 80 GB. Either disaggregate (small for short-context, big for long) or skip the H100 tier.
  • Many simultaneous users (>15 concurrent active sessions). The DGX’s 252 GB HBM is the difference, and there is no cheap substitute.
  • Frontier-quality coding tasks. Qwen3-35B does not match Qwen3-235B on hard SWE-bench-style problems; a small box does not paper over that.

If any of these dominates the workload, the small-hardware path is a stepping stone to the DGX tier, not the destination.

10 Auditioning on rented hardware

Test the full stack on rented compute before committing to a box.

Phase 1 — quick test (1–2 days, ~$100–$200 AUD). Rent a single H100 80GB on RunPod ($2–3 USD/hr), deploy the smaller Qwen3–30B-A3B with vLLM, and check our actual workloads — agentic tool-calling, document processing, code — against the API.

Phase 2 — full-scale test (1–2 weeks, ~$500–$1,000 AUD). Rent 4–8× H100s, deploy Qwen3–235B at AWQ 4-bit, run abliteration on it, then simulate 5–15 concurrent users with realistic agentic load and measure throughput, latency, and KV pressure. This is where the order-of-magnitude figures above get replaced with numbers for our own workload.

Phase 3 — DGX Cloud validation (optional, ~$1,000–$2,000 AUD). Run the DGX Cloud / NIM stack via a hyperscaler marketplace so the migration to physical hardware is seamless.

Phase 4 — buy and deploy. Order through an Australian vendor (lead times are months, not weeks), finalise the de-censored weights on cloud while waiting, then on arrival transfer weights, stand up vLLM or NIM, configure access control, monitor for a week, and open it to the collective.

11 What it costs

Item One-time cost (AUD) Ongoing monthly (AUD)
DGX Station GB300 $135,000–$195,000
UPS + electrical work $2,000–$5,000
De-censoring (abliteration + DPO) $1,500–$4,000
Cloud audition (phases 1–3) $800–$3,200
Total one-time ~$140,000–$207,000
Electricity (1600W @ 24/7) ~$350
Hardware amortization (3yr) ~$4,500
Internet (business NBN or colo, see risks) $150–$1,000
Total monthly ~$5,000–$5,900
Per member/month (50 members) ~$100–$118

Amortised over three years, the one-time cost adds roughly $55–$75/month per member in a 50-person collective, for a total cost of ownership of $155–$193/month per member depending mostly on whether we host at home or in a colo. That’s competitive with commercial API access for serious users — with no rate limits, no per-query metering, and full data sovereignty.

12 Risks

12.1 Hardware failure

One box is a single point of failure. NVIDIA enterprise support through partners like XENON buys next-business-day replacement for $10,000–$20,000/year; failing that, a collective on NIM can fail back to cloud inference seamlessly during downtime. In a serious geopolitical crisis the supply chain might be disrupted indefinitely and support might not deliver, so a spare is worth buying.

12.2 Model obsolescence

A frontier-class model today won’t be in 18 months. The hardware doesn’t share that fate — it runs whatever future models fit in its memory, and 784 GB of unified memory is generous enough to stay relevant across several generations of open weights.

12.3 Network

Australian residential broadband — older HFC and FTTN NBN especially — isn’t reliable enough for a production service: multiple brief outages a day, asymmetric upload (often 20–40 Mbps). Token generation needs only a few KB/s per user, so the problem is connection stability, not throughput. In ascending order of cost and reliability:

  • Residential NBN + 5G failover (~$150/month): a dual-WAN router rides out most brief outages. Good enough for a tolerant collective.
  • Business-grade NBN (ABB Business, Superloop Business, ~$150–$250/month): better SLA, static IP, priority faults, still on shared infrastructure.
  • Quarter-rack colocation (~$500–$1,000/month at Equinix SY1–SY5 or NEXTDC S2): redundant fibre, generator backup, physical security. Loses the under-a-desk feel, gains dependable uptime.

The right choice depends on how many members are remote versus local, and how tolerant the collective is of downtime mid-agentic-workflow.

12.4 Operations

Someone has to administer this. NIM keeps it at “run a Docker container”, updates pull from NGC, and access control is a standard reverse proxy (nginx + OAuth2). Weekend-sysadmin complexity, not a full-time ops team.

14 Next steps

If we’re interested in forming or joining a compute collective, the immediate actions are:

  1. Gauge interest: find 10–20 people who would commit to ~$150/month for sovereign AI access
  2. Run a cloud pilot: $200 buys a weekend of testing on rented GPUs
  3. Choose a legal structure: start as an incorporated association, convert to a cooperative if the model works (see legal structure section above)
  4. Order hardware: current lead times for DGX Station are “months, not weeks” — starting the order process early is important
  5. Document everything: the playbook we write becomes the replication kit for the next collective

The companion post makes the case for why. This post, I hope, makes the case for how.