Fine-tuning danbot
2026-05-26 — 2026-05-30
In Which Before-and-After Prose Pairs Are Harvested From Manual Edits, a LoRA Adapter Is Trained on Qwen3 8B via Together AI, and a Four-Signal Evaluation Harness Is Assembled.
Claude is a bad ghostwriter for me. Not unintelligible, not slow, not even particularly inaccurate — just wrong in the way LinkedIn posts are wrong: smooth, hedging, structurally signposted, breathlessly enthusiastic in flat places. I have tried the standard ladder of prompt-side fixes: a hand-tuned /dan-voice skill with a banned-words list and a structural slop catalogue, a Vale ruleset as a mechanical safety net, in-context examples of my own prose stuffed into the system prompt as few-shot exemplars, multiple rounds of “rewrite this in Dan’s voice” against a frontier model. The output comes out marginally more Dan-shaped each iteration and still smells like a press release. There is a floor on how much taste we can install via prompt engineering, and currently I am still trapped, according to this metaphor, in the taste basement. No matter what I do, I cannot dissuade Claude from poorly chosen mixed metaphors like “standard ladder of prompt-side fixes” or “floor on how much taste we can install via prompt engineering.” The text pudding is laced with “dan-shaped”, and “load-bearing” tuppences.
Look, I’m not claiming my style is perfect. But I am claiming that
- It is mine, and
- that when I write something on Monday in my own style, it is composed in such a manner that reading it back on Friday, I discern my own intent without excessive strain.
Both these are untrue of Claude’s output.
If I could train a small specialist model that bakes the style into weights rather than pleading for it via prompt, every AI draft I touch could pass through it and come out sounding like me, or at least, hopefully not substantially worse despite having been interfered with by an LLM.
So! Fine-tuning danbot!
My policy on this blog is that slop should be disclosed, not concealed — see the three robots marker at the top of this post? That means it is slop af.
My policy is not to have no AI. Sometimes it is useful to sketch stuff out with AI. It is not useful to pretend that AI is a human, or specifically that it is me. That does not mean that it should be horrible to read.
1 Goals
- Less onerous clean-up of AI edits.
- More comprehensible prose
2 Non-goals
- Full dan impersonation
- Evading slop detectors
3 Methods
The plan is to train a small LoRA adapter on top of a mid-sized open-weights model and treat it as a specialist prose styler: feed it AI-flavoured draft text, get back something closer to my own voice. Four steps:
- Collect real edit pairs. Every time I rewrite an AI draft, capture the (before, after) diff at paragraph level via
uv run ai-style-log. Already shipped — see Phase 1 status below. - Synthesise more pairs. Ask a matrix of models to paraphrase paragraphs of my own prose; the paraphrase is the slop input, the original is the target. The Phase 1 real pairs are the ground truth — they ride my actual editing distribution — so they get duplicated up if they look meaningfully cleaner than the synthetics on a spot-check; the synthetics are the high-volume bulk that gives the LoRA something to fit.
- Train a LoRA. Run LoRA SFT on Qwen3 8B in a managed cloud pipeline (~$10 per training run for the 8B, cheap enough to iterate freely).
- Wrap it in a CLI.
uv run ai-style <file>— explicit invocation, not a hook, not part ofai-preen; runs against the served adapter, with the option to pull weights down and serve locally via MLX on the Mac if I ever want to escape the cloud.
What could go wrong: the 8B might not have the headroom to learn the style without losing general fluency (then we try the 70B), the synthetic paraphrase slop might not match the distribution of real Claude output I hit in practice (then we mix in more real pairs from the logger), or fine-tuning might just be a dead end and the right answer is a stronger prompt against a frontier model. In the last case we still have a labelled corpus of my own prose that is useful for other things.
3.1 Turnkey training/serving
Two obvious options for cheap LoRA on open weights are Together and Fireworks. At the sizes I care about, training prices are basically interchangeable: Together is $0.48/1M training tokens for ≤16B LoRA SFT, Fireworks is $0.50/1M; at 70B it is $2.90 vs $3.00. The picture diverges further up the catalogue — Fireworks scales roughly linearly while Together’s “specialised” tier for the big base models (DeepSeek, Kimi K2, GLM-5) jumps to $10–$40/1M with a minimum charge. For Qwen3 8B, either is fine.
Together wins on training-data portability. Their docs publish a GET /finetune/{id}/download endpoint that hands back the LoRA adapter as a file. Fireworks is built around the assumption that you serve the fine-tune on Fireworks — their training docs call dedicated deployment “the only supported method for serving fine-tuned models,” and you can apparently get weights out by emailing support, but that is not a workflow I want to be in for a personal-use adapter.
Fireworks wins on serving cost. Together hosts a fine-tune only as a dedicated endpoint — $6.49/hr for an H100, which is fine for a production workload but $4,672/month if I forget to stop it for a personal CLI I use once a day. Their own docs say “stop the endpoint when you’re not using it.” Fireworks dedicated deployments scale to zero via --min-replica-count 0 and a --scale-to-zero-window (default 1h, minimum 5m). The first request after idle returns 503 DEPLOYMENT_SCALING_UP while the deployment wakes up; their docs ship retry-with-backoff snippets in Python, JS and curl. For a personal CLI that fires a handful of times a day, scale-to-zero is the right economic shape and Together’s hourly-billed-while-idle is the wrong one.
So the move is to train on Together, download the adapter, upload it to Fireworks, and serve from Fireworks. Fireworks accepts externally-trained LoRA adapter uploads (see their “importing fine-tuned models” docs), which is the bridge. Best of both: Together’s exportable training output plus Fireworks’ idle-aware hosting.
Caveat on Together: their Zero Data Retention is opt-in rather than default, so the org-level toggle has to be set on day one or the training data is fair game for retention. Not a deal-breaker for my own prose, but worth knowing.
3.2 Where Unsloth fits
Unsloth is the obvious DIY alternative — a Python framework around hand-tuned CUDA kernels (the headline “2× faster training with 70% less VRAM”), plus a newer local UI (Unsloth Studio) wrapping it. For a one-shot 8B LoRA the ~$10 saved on Together isn’t worth a day of pipeline-learning, and the Apple Silicon story is still landing — issue #685 has 600+ reactions and is still open, and the docs page contradicts itself between “macOS training is ALL supported” and Apple Silicon under “Coming next.” Where it would actually win is the second round: once the dataset is fixed and I want twenty hyperparameter runs, the marginal cost on a rented GPU approaches free while Together’s per-run pricing starts to bite. On the shortlist for that.
3.3 A note on Tinker
Tinker (Thinking Machines’ managed-training API, launched October 2025) is a comparable alternative to Together — Qwen-heavy catalogue, ~price-parity (~$0.40/1M training tokens at 8B vs Together’s $0.48), adapter download supported via REST. Catalogue covers every base I’d actually want to upgrade to from my Mac local-LLM picks (Qwen3.6-35B-A3B, Nemotron-Cascade-2, Qwen3.5-122B-A10B). I’m defaulting to Together for v1 because it is self-serve and the ecosystem is more mature. Where Tinker actually wins is later: its API exposes forward_backward, sample, and optim_step as low-level primitives, which is the shape needed for RL or DPO loops if I ever preference-tune the styler against a multi-signal reward. Together’s higher-level SFT/DPO API doesn’t expose that control.
3.4 Pangram and the multi-signal eval
Pangram is an AI-text-detector — a transformer-based supervised classifier trained on ~1M human/AI documents with hard-negative mining and active learning (Emi and Spero 2024), not a perplexity heuristic. The strongest independent evidence is (Jabarian and Imas 2025), a Chicago Booth working paper benchmarking it against OriginalityAI, GPTZero, and RoBERTa: Pangram is “the only detector that meets a stringent policy cap (FPR ≤ 0.005) without compromising the ability to accurately detect AI text.” Its strong regime is fully AI-generated single passages; its weak regime is lightly-humanised long-form, exactly where styler output sits (cf. (Soto, Chen, and Andrews 2025) on fingerprint recovery at multi-document scale).
Before paying for it, the open-weights options deserve an audition. Binoculars (cross-perplexity heuristic, MIT, ~70–90% AUROC depending on regime), Ghostbuster (Berkeley, perplexity-feature classifier), RADAR (adversarially-trained, Apache-2.0), and OpenAI’s deprecated RoBERTa-base detector all run for free. The test corpus is small: ~10 paragraphs of styler input, ~10 of styler output, ~10 of pure-Dan prose. If any of them tracks Pangram’s judgements well enough on single-paragraph lightly-humanised text — the regime I care about — the paid spend is wasted and the open-weights model goes straight into the eval harness instead. My honest prior is that none of them will be good enough at this regime, but the audition is one afternoon’s work.
If Pangram does win the audition, the move that makes it cheap is distillation. Spend ~$50 once labelling ~10,000 paragraphs from my corpus plus held-out AI text, then train a small local classifier on those labels (DistilBERT-class, <100ms inference). The local classifier becomes a free reward signal callable from anywhere in the pipeline — eval, inference-time best-of-N candidate selection, possibly a future DPO loop. One ~$50 spend, no more API calls.
One thing this isn’t: a tool for evading slop detection. My policy on this blog (see Non-goals) is that AI involvement should be disclosed, not concealed. The reason Pangram is useful here is that the patterns it trips on — uniform sentence rhythm, hedging, signposting, generic vocabulary, structural tics — are almost exactly the patterns my slop catalogue already targets. So a styler that learns to lower its Pangram score has, as a side effect, learned to write less like a press release. The standard reward-hacking worry — model learns to evade the detector without becoming better prose — would require constructing a “passes-Pangram-but-still-bad” output, which turns out to be hard. The score going down is a consequence of writing more like Dan, not the target.
The full eval harness has four signals. Vale catches mechanical slop (banned words, indefinite “you”, -ize spelling); LLM-as-judge catches “is this Dan-shaped” at the voice level; Pangram (or whichever classifier wins the audition above) catches “would an external detector flag this” at the statistical level; my eyeball is the veto channel. Around $80 in setup costs and ~$1 per run.
The interesting design choice is the training data. We have around 1.6M words of my own prose already in the repo, sitting in automation: 0 files — that is the target side of the pairs, and it is fixed. The slop side has to be manufactured: take a sample of my paragraphs, ask a model to paraphrase each (“rewrite this passage to be clearer and more polished”), and use the natural AI-voiced output as the input to map back from. Sample on the order of 3–5k pairs at the paragraph level — enough to nudge the model without obliterating an 8B’s general competence. A smaller secondary set is generated by explicitly asking for the worst slop patterns from my own slop catalogue; that covers the long-tail mannerisms the paraphrase pipeline tends to underrepresent.
3.5 Whose slop?
The obvious version of this is a de-Claude-er: train on Claude’s paraphrases, learn to undo Claude. But that bakes in an assumption I don’t actually want, and thinking it through changed the design.
Slop comes in two layers. There’s a shared core — throat-clearing, “it’s worth noting”, the rule-of-three, “not only X but Y”, signposting, reflexive hedging, bullet sprawl — that every RLHF’d model produces, because they’re trained on overlapping data toward the same inoffensive-helpful target. On top sits a thicker-than-I-initially-thought model-specific layer: GPT’s “delve” and “tapestry”, Claude’s particular hedging cadence, and so on. (Attar et al. 2026) put numbers on this with 284 interpretable linguistic features across 27 LLMs and 10 text domains: most signals turn out to be model- or domain-specific, only lexical richness generalises robustly across both (removing it drops classifier performance by up to 27.45% on XSum out-of-distribution). So most of the slop signal lives in features that vary across models — which strengthens the case for multi-source training rather than weakens it.
The right framing is then a Dan-ifier, not a de-Claude-er: rewrite the input as I’d write it, whatever the source. (Soto, Chen, and Andrews 2025) is the literature’s direct backing for this — generic “evade an AI detector” optimisation fails to erase underlying stylistic fingerprints (style-based detectors stay at 95–97% AUROC even after sophisticated evasion attacks), but the equivalent attack that also mimics a specific human author’s style passes style-based detection in the single-document regime. Target-author training is where the leverage lives; generic anti-AI training doesn’t.
Two wrinkles push me further from single-source training.
First, the styler is itself an open model — Qwen3 8B with a LoRA on top. At generation time its instincts are open-model instincts, so the tics it’s most likely to emit are open-model tics. If I only ever train it to undo Claude, it never learns to clean up after its own family. (Antoun, Sagot, and Seddah 2023) runs the direct test (train a classifier on text from one LLM, evaluate on another) and finds a particularly sharp chat-tuned-vs-base split: classifiers trained on chat-model output struggle on base-model output and vice versa. The styler sits at the chat-tuned end, so its training set needs to include open-model paraphrases (Qwen, Hermes, the models I run locally anyway) precisely so it suppresses the patterns it’s most prone to produce. The strongest version: generate some of the slop by running the base Qwen3 8B over my skeletal notes, which is the most on-distribution “styler’s own slop” there is.
Second, “Claude” isn’t one voice. I use Haiku, Sonnet, and Opus at different points, across revisions, and their slop profiles differ — Haiku is terser, Opus more floridly structured, and each API bump drifts a little. Pinning the training data to one model at one revision would teach the styler a narrower target than the one I actually feed it.
So the slop generator is a matrix, not a single model: mostly Claude (across Haiku/Sonnet/Opus, since that’s my real distribution), plus a deliberate minority of GPT, Gemini, and open-model paraphrases. The target stays 100% me throughout. And the eval holds out a non-Claude slice on purpose — if the styler only de-slops Claude and chokes on a pasted GPT draft, that’s a measurable failure I want to catch before I’m relying on it. This isn’t a novel design choice: (Paneru 2026) build a 25,140-pair AI→human corpus and fine-tune BART and Mistral-7B (QLoRA) for an adjacent purpose, using two generator models rather than one with the verbatim reasoning that “using two generators rather than one reduces the risk that a trained humanizer learns to undo one model’s idiosyncratic habits rather than AI-style writing more broadly,” and span multiple human-target domains for the same reason. They don’t run the single-source vs. multi-source ablation directly, so the quantitative cost of single-source remains open — that’s the small research artefact that falls out of the cross-source eval slice anyway. Their other useful finding I’m stealing for the eval harness: distinguish marker shift magnitude from marker shift accuracy — a humaniser that overshoots the target distribution looks superficially impressive on aggregate scores while landing in the wrong place.
Where we are right now: Phase 1 has shipped. There is a manual pair-logger (uv run ai-style-log) wired in such that every time I rewrite an AI draft into my own voice, the (before, after) diff is captured at paragraph level into a JSONL file. So even before we run a single training step, the corpus accumulates with each notebook I clean up. To make the corpus more representative, I have also stripped the automatic slop-removal pass out of the /dan-voice skill: drafts Claude generates in my voice now leak realistic AI tics, which is exactly what we want as training input. Manual cleanup is now via the separate /slop-hunter skill, invoked when I want a clean draft rather than a training-data candidate.
4 Wiring it into VS Code
Typing uv run ai-style-log open notebook/foo.qmd over and over is friction. VS Code already knows which file is currently focused; we can lean on that via its task system, which substitutes the editor’s path into a shell command at run time.
Drop a .vscode/tasks.json in the workspace:
{
"version": "2.0.0",
"tasks": [
{
"label": "ai-style-log: open current file",
"type": "shell",
"command": "uv run ai-style-log open '${relativeFile}'",
"presentation": { "reveal": "always", "panel": "shared", "clear": true },
"problemMatcher": []
},
{
"label": "ai-style-log: save current file",
"type": "shell",
"command": "uv run ai-style-log save '${relativeFile}'",
"presentation": { "reveal": "always", "panel": "shared", "clear": true },
"problemMatcher": []
},
{
"label": "ai-style-log: save --keep-open current file",
"type": "shell",
"command": "uv run ai-style-log save --keep-open '${relativeFile}'",
"presentation": { "reveal": "always", "panel": "shared", "clear": true },
"problemMatcher": []
},
{
"label": "ai-style-log: drop current file",
"type": "shell",
"command": "uv run ai-style-log drop '${relativeFile}'",
"presentation": { "reveal": "always", "panel": "shared", "clear": true },
"problemMatcher": []
},
{
"label": "ai-style-log: list",
"type": "shell",
"command": "uv run ai-style-log list",
"presentation": { "reveal": "always", "panel": "shared", "clear": true },
"problemMatcher": []
}
]
}Then in (user-level) keybindings.json:
[
{ "key": "cmd+k cmd+o", "command": "workbench.action.tasks.runTask",
"args": "ai-style-log: open current file",
},
{ "key": "cmd+k cmd+s", "command": "workbench.action.tasks.runTask",
"args": "ai-style-log: save current file",
},
{ "key": "cmd+k cmd+i", "command": "workbench.action.tasks.runTask",
"args": "ai-style-log: save --keep-open current file",
},
{ "key": "cmd+k cmd+d", "command": "workbench.action.tasks.runTask",
"args": "ai-style-log: drop current file"
},
{ "key": "cmd+k cmd+l", "command": "workbench.action.tasks.runTask",
"args": "ai-style-log: list"
}
]${relativeFile} is the editor’s path relative to the workspace root. Cmd+K is VS Code’s chord prefix, already used for Cmd+K Z and similar, so chord-style bindings do not collide with single-key editing shortcuts. Mnemonics: O for open, S for save, I for interim (keep-open), D for drop, L for list. We could scope these using a clause like ”when”: “resourceLangId == ‘quarto’ || resourceLangId == ‘markdown’”. This would restrict them to qmd or markdown buffers so I cannot accidentally open a session on a Python file. But I like living dangerously.
The workflow becomes: edit a draft, Cmd+K Cmd+O to begin a session, edit more, Cmd+K Cmd+S to save and close. No copying paths, no terminal context switch.
5 It all came to a head
Why am I doing this? Because I recently had the following conversation with Claude:
This document is full over overwrought, rambling, almost schizophrenically-dense confusing prose. I need you to help me tidy it up. I will show you one example of how the text started out (“original”) and how I tidied it up (“tidied”)
original:
The one thing maths rewards that ordinary agentic work does not is what makes the cloud earn its keep. A single solve is irreducibly sequential — each code block depends on the last, and no hardware shortens that chain. The parallelism is across samples and problems: the maj@k draws are independent, a problem set’s entries are independent, and a prover’s Pass@32 is thirty-two independent chains. That is the axis to fan out along, and the binding constraint is almost never the GPU. The three roles parallelize unevenly: the model server already batches many chains against one card (vLLM, SGLang), the orchestrator is just \(k\) async loops, and the laggard is the executor — a hand-rolled loop holds one kernel, a bare lean-repl compiles one proof at a time. So the question a workflow answers is not “how big a GPU” but “how do we run many executors cheaply”.
tidied:
One weird trick to make provers go better is to sample several independent attempts at solving the same problem, and choosing the most popular solution. This is the so called maj@k-trick. Provers have an equivalent one parallel trick, called Pass@k. Either way you can run a lot of these fuckers at once. The LLM tokens are delivered over the network and as such are parallel. So our executor handles \(k\) conversations with the model. The executors may as well be run in parallel too, ideally on \(k\) different machines running whatever tool is needed for that chain.
Do you see what I mean? The first one was full of baffling unclarity, introducing things in needlessly complicated, even incoherent ways, and ultimately after reading it I felt much stupider than before. The second version omits needless bullshit and communicates, in context, what the reader needs to understand. Do you think you can go through the notebook section-by-section and edit each bit so it sound less batshit raving insane? This is a long document, and you do have a tendency for, let us say, prolixity, so, hmm don’t worry about matching my voice or whatever right now, forget all that shit. Just do your best to turn this ululating turd-burger into something humans can read and become elucidated thereby. Delete info that doesn’t help them. Drop useless crap. Don’t forget they can look up things in the attached git repo. Here is not a time for falling in the info latrine while info dumping. Rather, it is a time to stop, take a breath, and consider what we need to say.
Do you think you can do that?
The answer was ofc that Claude thought he could do that. However, Claude managed to stay tidy for about three paragraphs before descending once again into shoggothy litany.
