AI agents, applied

Stack vocabulary, harnesses, MCP, and the products that wrap them

2025-02-02 — 2026-06-24

Wherein a Layered Vocabulary for AI Agent Infrastructure Is Established, Harnesses Are Compared as to Minimalism and Extensibility, and the Demands of Long-Running Gateway Processes Are Surveyed

AI safety
computers are awful together
faster pussycat
language
machine learning
neural nets
NLP
premature optimization
technology
UI
Figure 1

Everyone is using agentic AI now. How do I do that? Should I do that? What is the least worst way to extract value from The Machine without divulging all my secrets to the Man? This notebook collects agent-generic knowledge; later I put it to work running LLMs locally on a Mac and with specialized coding and mathematics agents.

As is appropriate for the subject, this page is slop — albeit slop distilled from my own notes while I installed things and tried to understand how they worked.

1 Vocabulary

1.1 The stack

Model
the weights themselves — Qwen, DeepSeek, mxbai-embed, Claude (closed weights), GPT (closed weights), etc. Distributed as .safetensors, typically from Hugging Face for open-weights models.
Runtime / inference engine
the code that runs the matmuls — llama.cpp, MLX, vLLM, SGLang, mlx-lm, antirez’s ds4 for single-model specialised cases. Where the compute happens.
Server / daemon
a long-lived process that wraps the runtime in an HTTP endpoint (almost always OpenAI- or Anthropic-compatible) — ollama serve, llama-server, mlx_lm.server, Osaurus, ds4-server, vLLM, Unsloth Studio’s local endpoint, hosted endpoints at Anthropic / OpenAI / Google. Stateless from the application’s point of view.
Harness / agent loop
the orchestration layer over the server — manages conversation state, tool calls, system prompts, multi-turn agent loops. Not necessarily obvious.
Frontend / chat client
the human-facing surface — a desktop chat window, a text-mode UI, a multi-channel messaging bridge, a code-editor plugin, a web UI.
Quantization format
how the weights are stored with reduced precision on disk for local execution — GGUF for llama.cpp, MLX safetensors for MLX, JANG for mavericks.

The pieces nearest the users (server / harness / frontend) do not care so much about the details of how the compute happens, and are generally less coupled. The compute parts (model / quantization / runtime) can be tightly coupled.

Many turnkey products are vertical bundles across several layers — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime, Unsloth Studio is runtime + server + fine-tuning UI, Claude Desktop is frontend + harness pointed at Anthropic’s hosted server.

1.2 What a harness does

The harness layer takes a model server (something OpenAI- or Anthropic-API-like, typically) and gives us agentic behaviour on top of it. Several affordances are used in modern agents:

  • Conversation state. Maintain a thread of messages across turns; decide when to compact older turns to fit the context window.
  • Tool calling. Expose tools (read file, run shell, query database, fetch URL) to the model in its prompt; parse tool-call responses out of the model output; execute them; feed results back as new turns.
  • System prompt management. Inject project-level instructions (AGENTS.md, CLAUDE.md, SOUL.md), per-session overrides, and skill descriptions before each model call.
  • Multi-turn loops. Run the model in a loop until it stops asking to run tools, with bounded iterations and error handling.

The differences between harnesses come down to design choices on top of those primitives:

  • What’s primitive vs what’s a feature. pi is all primitives, (almost) no features — no built-in MCP, no built-in sub-agents, no built-in plan mode. Claude Code, by contrast, ships with a lot of features — sub-agents, plan mode, agent teams…
  • Tool-call format. Different harnesses accept different tool-call response formats (JSON, XML, Qwen xml_function, Mistral [TOOL_CALLS], etc.) and have associated different parsing logic.
  • Skills / extensibility model. How the harness loads user-extensible behaviour (agentskills.io markdown files, TypeScript extensions, Python modules).
  • Where it runs. pi is a Node CLI; Hermes is a Python long-running gateway; Claude Code runs in an Anthropic desktop tab.

A harness and a model server may be somewhat independent. The same harness can talk to multiple servers (cloud Anthropic today, local Osaurus tomorrow), and the same server can be a backend for several harnesses simultaneously (pi + Cursor + a curl script all pointed at one mlx_lm.server). Many harnesses are model-agnostic, and can be configured to point at any OpenAI-compatible server, which can include OpenRouter, or some self-hosted model. Pointing at Xiaomi’s MiMo, say, direct or via OpenRouter — is a config change. Some stacks are more tightly coupled; Claude Desktop really wants Anthropic servers, for example.

1.3 MCP — Model Context Protocol

MCP is an open protocol for connecting LLM applications (clients, harnesses) to data sources and tools (servers).

MCP is an open protocol that standardises how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications.

The architecture is client-server — but client and server can both live on the same laptop and talk over stdio. A harness typically acts as MCP client. An MCP server is a separate process exposing tools, resources, and prompts.

MCP is not strictly necessary nor universal. pi deliberately does not ship it — its position is that an MCP server is just a wrapper around tools that could equally well be exposed as CLI tools with README files (i.e. Skills). Zechner quantifies the overhead: many MCP servers clog the context with tool instructions up front, whereas a CLI tool’s README costs tokens only when the agent needs it.

Some fun MCP servers:

1.4 Skills and agentskills.io

agentskills.io is another open standard for agents. This one packages capabilities into skills — markdown files with YAML frontmatter that describe a capability and its tools, loaded on-demand into an agent’s prompt.

A skill is either a single name.md file or a directory with SKILL.md plus supporting files:

---
name: skill-name
description: Short description of what this skill does
---

# Skill instructions
1. Do this
2. Then that

Skills are relatively transferable. pi, Hermes, Anthropic’s own Skills system, and various community frameworks all consume the same file format; a skill written for Hermes can be dropped into a pi extension directory and Just Work.

The contrast with MCP is one of binding time. Skills are typically loaded on-demand per session, often based on what the user is asking about. A harness might have hundreds of skills installed and only load three for any given session, keeping context usage low.

2 Build agents

We can just build our own agent. Many have. However, there exist libraries to solve common problems.

2.1 pi

earendil-works/pi (Mario Zechner / badlogic, MIT, TypeScript / Node) is a minimalist harness. Behaviour beyond the loop-and-tool-calls baseline gets added as skills or TypeScript extensions.

Zechner actually ships it as the “Pi Coding Agent”, so pi straddles the line between a generic harness and a coding one.

pi’s entire system prompt plus its four tools (read, bash, edit, write) come in under 1000 tokens, on the argument that frontier models are RL-trained enough to already know what a coding agent is, so a 10k-token system prompt buys little. Zechner backs that with a Terminal-Bench 2.0 run on Opus 4.5 that places pi competitively against harnesses with far more scaffolding.

The minimalism falls out of two commitments Zechner sets out at length. The first is context engineering — exact control over what enters the model’s context, on the premise that mainstream harnesses inject material behind our backs that never surfaces in the UI and degrades the output. The second is observability — being able to inspect every byte of every exchange, with a documented session format we can post-process.

One agent built on pi is the famously bloated OpenClaw, which uses pi as its agent core and adds interface gateways, persistent memory, and the rest.

Affordances:

  • TypeScript native — agents and tools are TypeScript modules with type-checked tool schemas.
  • OS-agnostic — Node CLI, runs on Mac / Linux / WSL.
  • Parallel tool calls — pi can fire several tool calls in a single turn, so any extension that wants parallelism gets it for free.

2.1.1 Subagents and context bloat

pi ships no subagent implementation at all. Its native answer to context bloat is the /tree command: we jump back to an earlier point in the chat history and pi summarizes everything since, collapsing the intervening turns into a précis — many of a subagent’s goals, but explicit.

Zechner objects to classic subagents on two grounds: they are unobservable, and they can be constructed explicitly instead, as separate sessions with shared file context.

No one else is so austere. The popular pi-subagents reinstates the familiar subagent(prompt, ...) tool, the child running on either fresh context (only its instructions) or forked context (a copy of the parent’s window plus instructions). Daniel Nouri’s pi-submarine is a smaller take — fresh/fork context, named agents as markdown files, nested subagents, resumable runs. Both make the subagent’s conversation observable, which is most of Zechner’s objection.

2.1.2 YOLO no guardrails

pi runs in full YOLO mode: unrestricted filesystem access, any command executed with our user privileges, no permission prompts, no Haiku pre-screening of bash commands. The rationale is that anything more is just security theatre; If we want a boundary, run pi inside a container, which maybe we should do in general.

2.2 smolagents

smolagents (Hugging Face, Apache 2.0, Python) is also minimalist, but for Python; there are about 1000 lines of code in agents.py.

The distinguishing feature is the CodeAgent paradigm: instead of the model emitting JSON tool calls that the harness parses and executes, the model emits Python code snippets that get executed in a sandbox (it resembles the TIR loop in mathematical agents).

Tool calls become function calls:

from smolagents import CodeAgent, WebSearchTool, InferenceClientModel

model = InferenceClientModel()
agent = CodeAgent(tools=[WebSearchTool()], model=model, stream_outputs=True)

agent.run("How many seconds would it take for a leopard at full speed to run through Pont des Arts?")

HF’s benchmark claim is that this paradigm uses ~30% fewer model steps than JSON tool-calling on difficult agentic benchmarks.

Affordances:

  • Model-agnostic — any HF Hub model via InferenceClientModel, OpenAI / Anthropic / Bedrock via the LiteLLM integration, local execution via transformers or ollama.
  • Tool-agnostic — MCP servers, LangChain tools, and HF Spaces all work as tools.
  • Modality-agnostic — text, vision, video, audio inputs.

Arbitrary Python execution from a language model is exactly as risky as it sounds. For real isolation smolagents supports various managed sandboxes — Modal, E2B, Blaxel — plus Docker for self-hosting.

A ToolCallingAgent is also available alongside CodeAgent for models fine-tuned for the classical JSON paradigm. There is an interactive CLI smolagent and also webagent, a Helium-based web-browsing agent.

Watching it work. smolagents is a library, not an app, so the default surface is whatever our script prints — but oversight comes in three escalating tiers. The base loop streams rich-formatted step panels to the console (one per thought / code-execution / observation), and after a run we can read agent.logs or agent.write_memory_to_messages() for the structured trace. For a human-facing window there is a built-in GradioUIGradioUI(agent).launch() opens a web chat that visualises each thought and tool call live, keeps the conversation going across turns (reset=False), and exposes agent.interrupt() as a stop button. For production oversight smolagents emits OpenTelemetry traces: pip install 'smolagents[telemetry]', call SmolagentsInstrumentor().instrument() once, and every step, tool call, and token count streams into whatever OTel backend we point it at — Arize Phoenix, Langfuse, or MLflow’s one-line mlflow.smolagents.autolog(). That telemetry path is the one to reach for once an agent runs unattended; the Gradio UI is the one for sitting and watching.

2.3 Qwen-Agent

Qwen-Agent (Alibaba, Apache 2.0, Python) is the agent framework the Qwen team build for their own Qwen models; it is the backend of Qwen Chat. Just like smolagents it is a python library. It is not to be confused with Qwen Code CLI, a different codebase.

There is no desktop app and no standalone CLI binary — unlike Goose, nothing to install-and-click. The unit of work is a short Python script that instantiates an agent object, which we then either drive as a terminal chat loop or pass to WebUI(bot).run() to launch a Gradio web UI from the same object. The quickstart is eight lines that do both; the examples/ directory has more elaborate demos.

Install like any other PyPI package. The maximally featureful incantation is

uv add "qwen-agent[gui,rag,code_interpreter,mcp]"

The four optional bracketed extras the framework bonus features — gui (the Gradio UI), rag (long-document retrieval), code_interpreter (Docker-sandboxed Python), mcp (MCP client). Drop any we do not need; bare uv add qwen-agent gives a minimal function-calling core.

Model configuration is via the llm_cfg dict

  • Alibaba’s hosted DashScope API — set DASHSCOPE_API_KEY and model_type: 'qwen_dashscope'.
  • Any OpenAI-compatible server — a local vLLM / Ollama / Osaurus endpoint or a cloud one — model_type: 'oai' plus model_server: 'http://localhost:8000/v1'.

Picking an agent class goes as follows:

  • Assistant is the standard entry point — a general tool-using agent that also reads files. Passing files=[...] at construction time handles PDF / Office / image input via the RAG pipeline. Most demos use this class; start here.
  • FnCallAgent is the minimal function-calling specialist — barely more than a loop around the function-call API, for when we want full control.
  • ReActChat implements the ReAct paradigm (interleave reasoning and action via a scratchpad). Targets backend modes that are not fine-tuned for native function calls and we need to coax tool use through prompting.

We extend it through register_tool / BaseTool (source) plus MCP — but agentskills.io skills are unsupported

Three nifty affordances

The framework grows a new demo with each Qwen release rather than freezing into a stable API — Qwen3-Coder native tool-call parsing, Qwen3-VL vision tool-calls etc.

2.3.1 Driving non-Qwen models

Qwen-Agent reputedly works best with Qwen models — the default tool-call template (fncall_prompt_type='nous') is tuned for Qwen3 / Qwen3-Coder / QwQ-32B. It will drive other models through an OpenAI-compatible endpoint (model_type: 'oai' above), but it might be janky. By default Qwen-Agent parses tool calls out of the raw model text itself; to defer instead to the server’s own native tool-call interface — needed for non-Qwen models, or for Qwen served behind vLLM’s built-in parser — we set use_raw_api: True in generate_cfg. Once again, this is in set up in llm_cfg . Qwen models are well supported on Apple Silicon, so for local use the Qwen-on-Qwen path is likely smooth.

2.4 Heavier orchestration

pi, smolagents, and Qwen-Agent are relatively simple structures, mostly a loop plus some tools. There is a heavier tier above them: LangGraph, CrewAI, AutoGen, LlamaIndex, the role-playing multi-agent frameworks.

It might be worth staying away from the more complicated orchestration structures. e.g. LangGraph models the agent as a state machine (nodes, conditional edges, cycles, persistence underneath): a run can checkpoint, survive a process restart, and resume from the last node, optionally pausing for a human.

CrewAI and Swarms scaffold some other interesting design patterns, e.g. several named agents passing work between them.

2.5 Picking a library

A rough decision rule:

  • TypeScript, want maximal control and observabilitypi.
  • Python-native computational work — data wrangling, plotting, NumPy/pandas in the loop, cross-vendor flexibility → smolagents; the CodeAgent paradigm fits when the natural step is “run this snippet” rather than “call this named tool.”
  • Running Qwen-family models locally, want batteries included — long-doc RAG, sandboxed code interpreter, Gradio UI, multimodal tool calls → Qwen-Agent.
  • The job needs deterministic branchingDynamic-Workflow scripts.
  • Checkpoint-and-resume-across-restarts is the binding constraint → escalate to LangGraph.

Prototype on pi or smolagents; only climb the tier when one of the last two bullets actually bites.

3 Ready-to-run harnesses

Between building our own (above) and the always-on assistants (below) is a middle category: finished, generic harnesses we point at any OpenAI- or Anthropic-compatible server and simply use. Coding is the obvious application but not the only one. pi straddles the line — minimal enough to count as a library to build on, complete enough to run as-is. The code-native tools — Aider, Cline — stay next door in the coding notebook; OpenCode and Goose below are general-purpose agents that happen to be good at code, and Open WebUI is the web-chat one, stronger on tool-use and retrieval than on code.

3.1 OpenCode

OpenCode (anomalyco/opencode, MIT) is a terminal harness supporting 75+ providers, local models included. It wraps LSP, MCP, and a plugin system, and has a massive community. We point it at any endpoint — a commercial token host or a local server — by adding a custom provider with a baseURL in ~/.config/opencode/opencode.json.

curl -fsSL https://opencode.ai/install | bash
# or: npm install -g opencode-ai
# or: brew install anomalyco/tap/opencode

Pro-tip: the canonical repo is anomalyco/opencode; AFAICT opencode-ai/opencode is name-squatting.

Xiaomi’s MiMo Code is a fork that adds long-horizon memory and a self-improvement layer

3.2 Goose

Goose (Apache-2.0, Rust) is the most grown-up of the bunch, in the sense that it pays its taxes and goes to meetings. By which I mean, it has been adopted by the Linux Foundation — the repo moved from block/goose to aaif-goose/goose under the Agentic AI Foundation. Unlike the Python and Node harnesses above it is a native desktop app and a CLI and an embeddable API,.

brew install --cask block-goose   # desktop app + CLI
# CLI only: curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh | bash

It works with many model providers, and notably will ride an existing Claude / ChatGPT / Gemini subscription through the Agent Client Protocol rather than demanding a metered API key. ACP runs both ways: Goose can also back an ACP-speaking editor like Zed or JetBrains.

Extensibility seems reasonable It supports many 70+ MCP servers. That is how it supports agentskills.io, in fact — it discovers SKILL.md skills out of the box through the built-in Summon extension (enabled by default), reading them from ~/.agents/skills/ globally and .agents/skills/ per project, with backward-compatible discovery of .claude/skills/ and friends

It also supports Goose-specific format,recipes, that package a prompt, parameters, tools, and extensions into a one-click shareable workflow, with subrecipes for fanning work out across subagents.

As part of the hacker aesthetic, the docs scatter the pitch across many pages; the Goose Janitor write-up might be the starting point. Its coding-agent face is documented next door.

3.3 Open WebUI

Open WebUI (open-webui/open-webui, open-source, self-hostable) is the web-UI member of the category, and it has grown from a chat frontend into a genuine harness. Point it at any OpenAI-compatible server, local or hosted. What tips it from renderer to harness is its Native Function Calling (“Agentic Mode”): instead of one pre-injected RAG dump, the model decides when to call a tool, reads the result, and calls again — search, read, check, search again — a real multi-step loop. On top of that it speaks MCP natively (since v0.6.31; earlier through the mcpo OpenAPI bridge), ships a Python code interpreter, a Tools/Functions (Pipe/Filter/Action) extensibility model, and built-in document RAG. It stays chat-centric, though: strong on tool-using conversation, retrieval and rendering, lighter on the autonomous long-horizon machinery — sub-agents, plan mode, permission gates — that pi and Claude Code lean on, and its agentic loop wants a frontier model to do the multi-step reasoning reliably. Its maths-rendering and fan-out-oracle role gets the detail in the maths notebook.

4 Personal AI assistants

A category of agent product distinct from coding assistants: the always-on, multi-channel personal AI that learns about us over time.

The shape is:

  • A long-running daemon on infrastructure I own (or rent cheaply — a $5 VPS, a home server, a Modal pod that hibernates when idle).
  • Multi-channel deployment — instead of a desktop window, the agent listens on Telegram, Discord, Slack, WhatsApp, Signal, iMessage, email — simultaneously.
  • Persistent memory across sessions, often with auto-generated skills.
  • Sometimes scheduled cron-style automations: daily reports, nightly backups, weekly audits.

4.1 Claude Desktop (Cowork, Code, Remote)

The Anthropic bundle has several distinct surfaces under one app.

Cowork is the personal-assistant tab — “Claude Code for non-developers.” Describe a multi-step task, grant Claude access to a folder, walk away while it works. Office integrations for Word / Excel / PowerPoint / Outlook plus Chrome are built-in. Closed-source, Anthropic-only, Claude as the model. The polished commercial option.

Within the one bundle, four distinct stories about where the work happens:

  • Cowork (local). The autonomous tab runs on the user’s machine. Code execution goes through a local VM Claude manages; computer-use does not — Claude pokes at the screen and apps directly, no sandbox. Within days of launch, Cowork was demonstrated to be vulnerable to prompt injection from web pages it visited.
  • Claude Code (local CLI). Several isolation tiers: sandboxed Bash (Seatbelt on macOS, bubblewrap on Linux), full process sandbox, dev container, custom container, full VM. The default “sandboxed Bash” tier only sandboxes Bash — other built-in tools (Read, Edit, WebFetch) run unsandboxed in the parent process. Several escapes have been published.
  • Claude Code on the web (claude.ai/code). Each session runs in a fresh Anthropic-managed VM with the repo cloned through a credential proxy. Credentials never enter the sandbox — git auth goes through a proxy with scoped tokens. Network access is limited by default; can be disabled entirely. The strongest isolation tier Anthropic offers.
  • Remote Control. Phone or browser drives a local Claude Code session; the agent still runs on the user’s machine, the phone is just a remote.

The pattern across Claude Desktop is “pick the surface, pick the isolation.”

4.2 Hermes Agent

NousResearch/hermes-agent — FOSS, MIT, Python 3.11 + uv. Nous’s own codebase end-to-end. Designed to run on infrastructure I own (a $5 VPS, a home server, my laptop, a Modal pod).

git clone https://github.com/NousResearch/hermes-agent.git
cd hermes-agent
./setup-hermes.sh     # installs uv, creates venv, installs .[all], symlinks ~/.local/bin/hermes
./hermes              # auto-detects the venv, no need to `source` first

What’s distinctive about the product:

  • Model-agnostic — 15+ providers plus arbitrary OpenAI- or Anthropic-compatible endpoints, switchable mid-session.
  • Multi-channel gateway — Telegram, Discord, Slack, WhatsApp, Signal, email, CLI all from one long-running process.
  • Closed learning loop — auto-generated skills from experience, FTS5 session search, persistent memory via Honcho dialectic user modelling. The pitch: the longer it runs, the more it knows about me.
  • Serverless deployment — Modal and Daytona backends with hibernate-on-idle so the agent costs cents between sessions.
  • MCP-native — first class, not bolted on.
  • hermes claw migrate — an explicit migration tool from OpenClaw, the giveaway about who they see as the user base they’re courting.

Hermes ships the agent loop separately from the execution sandbox. Six backends, picked via terminal.backend in config:

Backend Where it runs Isolation Setup
local Host as the user None Default — for testing only
docker Single persistent container Linux namespaces, dropped caps Docker installed
ssh Remote box the user owns Network boundary SSH key + host config
modal Modal cloud VM per task Strongest Modal account
daytona Daytona managed workspace Strong; resumable Daytona API key
singularity HPC-style Apptainer container Namespace isolation without $HOME apptainer installed

A subtle but useful detail: remote backends sync touched files back to the host on teardown into ~/.hermes/cache/remote-syncs/<session-id>/. No need to remember to scp artifacts off the cloud sandbox manually.

Hermes borrows OpenClaw’s DM pairing pattern and adds an encrypted secret-exchange flow for credentials the agent needs at runtime — the user pastes a secret into pi.dev/secret, gets an encrypted blob, pastes it into chat, and the gateway decrypts it locally with an ephemeral private key, storing the cleartext inside the sandbox without the agent itself ever seeing it.

4.3 OpenClaw

openclaw/openclaw — Peter Steinberger’s personal AI assistant project, built on pi. Same shape as Hermes (multi-channel, persistent, owns its own infrastructure) but more cowboy.

The long-running process is a single gateway daemon installed as launchd (macOS) or systemd (Linux) user service. The gateway routes messages from ~25 channels (WhatsApp, Telegram, Slack, Discord, iMessage, Matrix, WeChat, …) to agent sessions, which can be sandboxed with Docker, SSH, or OpenShell.

Default isolation: the main session (interactive use by the owner) runs tools on the host with no sandbox. Non-main sessions (group chats, automation, external users) get sandboxed if the operator opts in (agents.defaults.sandbox.mode: “non-main”). Default deny list for sandboxed sessions covers browser, canvas, nodes, cron, discord, gateway.

Cool tricks:

  • DM pairing. Unknown senders on any channel get a pairing code and the bot ignores them until openclaw pairing approve adds them to a local allowlist.
  • Companion apps. Optional macOS menu-bar app, paired iOS/Android nodes — useful if someone in the household wants the assistant in their pocket.

No cloud-VM backend; for remote execution OpenClaw points at a box the operator already controls via SSH.

4.4 Picking by what we want to do

If we’re building from scratch on top of pi, OpenClaw is the reference implementation; if we’re picking a finished product, Hermes is the from-scratch Python alternative and Cowork is the closed commercial one. Anchored on the well-known options:

  • I want Claude to do stuff with my files and apps while I work on something else. Cowork. Accept the security caveats; do not let it browse untrusted websites.
  • I want to ship a coding task to a hosted sandbox where my credentials never leave my laptop. Claude Code on the web.
  • I want an assistant that messages me on WhatsApp / iMessage / Telegram and runs on my own infrastructure. OpenClaw if channel breadth matters most; Hermes if isolation choice matters most.
  • I want the same agent reachable from my phone, running on a Modal pod that hibernates between sessions. Hermes with modal backend.
  • I want a small group to share one assistant across Slack and Discord, with new members onboarded via DM pairing. OpenClaw, sandbox non-main sessions, channel allowlists.
  • I want the assistant backed by my own local LLM. Hermes or OpenClaw — drop the local-server URL (Osaurus, Ollama) into the model config and it is just another provider, with cloud fallback. Cowork cannot: it is locked to Anthropic’s hosted Claude.

5 Run it — where the agent actually lives

The cross-cutting operational concerns once we have picked a product or library: the vocabulary for the long-running process, the design patterns at the harness layer, where to host the always-on gateway, and what tends to go wrong.

5.1 The moving parts

For a long-running agent — one that holds state across days, runs unattended, accepts inbound messages — “the agent” is at least four moving parts:

  1. The foreground surface the user types into. Short-lived per session.
  2. The daemon or gateway that holds state, routes messages, manages sandboxes. May or may not exist (Claude Code without web mode has no daemon; Hermes and OpenClaw revolve around theirs).
  3. The execution sandbox where shell commands and code actually run. Local, containerised, or a cloud VM.
  4. The channel surface — Telegram, Slack, Discord, email, a web UI, a phone.

Our example agents use these layers very differently.

5.2 Design patterns — sub-agents, plan mode, permission gates, compaction

A loose grab-bag of design patterns at the harness layer that have crystallised into industry conventions:

Sub-agents. Spawn an isolated child agent for a sub-task — its own conversation history, its own tool budget, returning a summary back to the parent. This buys context economy: the dirty work (grepping code, searching the web, writing throwaway spikes) goes to a child while the parent’s context stays lean enough to keep planning. Useful for “explore the codebase and find X” tasks that would otherwise pollute the parent’s context with read-tool calls. Claude Code supports this natively as agent teams; pi does not, on principle, but an extension adds it.

Plan mode. The agent writes out a plan, the user approves it (possibly editing), then the agent executes. Reduces wasted runs on tasks the agent has misunderstood.

Permission gates. Tool calls require user approval before execution. Most harnesses ship some form of this for “destructive” tools (file write, shell exec). The trade-off is more permission prompts vs more autonomy.

Compaction. When context approaches the model’s window limit, summarize older turns to free space. Done well, transparent; done badly, the agent loses the thread.

5.3 Long-horizon agents — memory, reliability, evolution

A long-running agent — one that holds a task across hundreds of turns, or returns to the same project next week — hits three distinct barriers. The context window fills up (memory); single-turn decisions get less reliable the longer the run (reliability); and whatever the agent worked out evaporates when the session ends (evolution) because current LLMs are not continual learners. This split is due to Xiaomi’s analysis in MiMo Code We use MiMo as the worked example and note other live approaches.

5.3.1 Memory

Ideally we hold state across the session, even when it blows our context budget. The simplest way to do this is compaction: summarize the old turns and carry on. But summarize-the-old-turns compaction degrades on long tasks (TODO: get a decent paper summarizine how).

pi’s /tree is one solution, where we constantly select context manually

MiMo Code’s answer moves memory out of the main loop entirely. The main agent keeps no notes of its own. A separate writer subagent, fired by the runtime at fixed points — roughly 20%, 45%, and 70% of the context budget, deliberately early, while there is still room to think and before the lost-in-the-middle effect erodes the extraction — reads the conversation and writes a structured checkpoint to disk. As the window fills, the runtime opens a fresh one and rebuilds context from those files, so from the model’s point of view the conversation never breaks. Memory spans four layers on different lifecycles: a session checkpoint (checkpoint.md), persistent project knowledge (MEMORY.md), cross-project preferences, and a full unindexed SQLite trace of every message and tool call underneath, as the fallback when something is missing from the structured layers. I assume there is some RAG somewhere in the mix?

5.3.2 Multi-turn reliability

The second wall is reliability over long tasks. A long-horizon run is a chain of individual turns — read context, decide, call a tool, repeat — and each turn carries some chance of a wrong call: editing the wrong file, passing a test by changing the test, calling an API that does not exist… Those per-turn odds compound, so even a low rate dominates a long enough run — at i.i.d. error rate of 2% per turn, a 200-turn task finishes clean only about 2% of the time (\(0.98^{200}\approx0.02\)). A short interactive session is self-correcting — we spot the bad turn and fix it — but an unattended multi-hour run has nobody watching, so this is essential for maxxing that famous the reliable task-length metric (METR’s measure, doubling every few months).

MiMo Code attempts to deal with this in a few ways

  • Goal is their evolution of the Ralph Wiggum Loop: a stopping-condition verifier. The user states a done-condition in natural language, and each time the agent tries to stop, an independent model call checks the whole history against it and hands back the gap if the work is not finished — the cure for an agent that declares victory too early.

  • Max mode also spends extra inference-time compute, sampling several candidate plans in parallel and using a low-temperature judge to choose one, for a vendor-reported 10–20% gain on SWE-Bench Pro at four-to-five times the tokens.

  • Dynamic Workflow handles orchestration at scale: the main agent emits a JavaScript script — agent(), parallel(), pipeline(), workflow() — run deterministically in a sandbox, so branch and retry logic is guaranteed by code rather than by a model remembering to follow a SKILL.md.

    That code-beats-prompt principle for predictable control flow is originally Anthropic’s.

5.3.3 Evolution — learning across sessions

Another barrier is the ultra-long horizon one. If we come back to the same project and the agent has forgotten everything it seems like we are leaving performance on the table, re-deriving the same constraints and repeating the same mistakes. MiMo runs two background agents against its own history to fix this. Dream fires every seven days — it reads past sessions and the memory file, then merges, deduplicates, checks that file references still resolve, and compresses the result back down. Distill fires every thirty days and hunts for process rather than facts: recurring work patterns that it solidifies into reusable skills, CLI commands, and SOP documents.

Hermes has a different approach, learning the user rather than the project, accreting auto-generated skills, FTS5-searchable session history, and a persistent model of our preferences via Honcho’s dialectic user-modelling, on the pitch that the longer it runs the more it knows about us. OpenClaw keeps persistent memory in the same spirit. A self-writing skill library is also a standing risk: one bad prompt can mint a capability the agent reuses later, unprompted.

5.4 Hosting the gateway

The gateway has to be a live process for the messaging surfaces to work. Telegram, Discord, Slack, and friends need either a long-polling connection (the gateway dials out and waits) or a webhook endpoint (the messaging provider POSTs in) — both require an always-on Python process. Sub-agents on Modal or Daytona can hibernate; the parent gateway cannot, not if it expects to receive our next “hey, status?” from the phone. When my laptop sleeps, the bot is dead.

Five options for hosting the always-on gateway:

  • A $4–5/mo VPS (Hetzner, DigitalOcean, OVHCloud). The boring correct answer. Public IP, webhook channels Just Work, Hermes installs in a one-liner under systemd. Six months later we will have forgotten the VPS exists, which is the point.
  • A Raspberry Pi or repurposed laptop at home, with Tailscale Funnel or Cloudflare Tunnel handling the inbound webhook problem so we do not have to forward ports through our home router. ~$50–100 one-time + ~$1–2/mo electricity. Privacy wins; ISP and power outages become our problem. An old laptop with the lid closed and systemd-inhibit handle-lid-switch works just as well as a Pi.
  • Hermes deployed on Modal in webhook mode. Hibernates between messages, cold-starts on inbound, costs single-digit dollars per month for bursty personal use. Trade: ~2–5s cold-start latency on the first message after a quiet period.
  • Existing hardware — a NAS, a Mac mini behind the TV, an Intel NUC. Marginal cost is a few watts. Operational cost is whatever our tolerance is for “this is the box that holds the assistant; please don’t unplug it”.
  • Not a 24/7 RunPod instance. RunPod is GPU rental at $0.20–$4/hour — fine if we are also hosting a local model on the same box, but wasteful for a Python process that just proxies API calls. The right RunPod pattern is “spin up a GPU pod for inference, tear it down between uses” (a model-server play, not a gateway-hosting one).

For the “I want both the model AND the gateway on the same box” case, the question shifts from “where do I host the gateway” to “where do I host the model” — see running LLMs locally on a Mac and the Australian sovereign-LLM project for that side.

5.5 What can go wrong

The vulnerability surface of an agent is unusually broad — it spans the model, the tools it runs, the inputs it reads, and the vendor behind it. If classic computer security is about keeping intruders out of the house then agent security is more like managing a toddler who might let intruders into the house, or burn it down or open the door and run into traffic.

There are at least four kinds of failures that I have fretted about. Surely more exist in the wild.

  • The model makes mistakes. It misreads the task and does something destructive in good faith — deletes the wrong directory, force-pushes over someone’s work, pastes a secret into a public channel, emails the draft to the client instead of to us.
  • The code it runs is wrong. An agent that installs and runs software in good faith inherits that software’s bugs and side effects. A correct decision to call a tool is still only as safe as the tool itself.
  • The input is hostile. Web pages, emails, Slack messages, and documents can carry prompt injection that redirects the agent’s tools toward someone else’s goals. This is the famed prompt injection attack.
  • The provider is a trust assumption. Unless we are running the model ourselves, every prompt, file, and pasted secret the agent touches also goes to the token vendor. That makes the vendor a high-value target — a breach there can expose many customers’ data at once. Moreover, they are a party whose interests need not match ours: logs can be retained, scanned for training data, mined by an insider, or handed over under subpoena. Not divulging all our secrets to the Man is not a solved problem; self-hosting relocates that trust under our own roof rather than removing it.

tl;dr An agent wiring together our private data, code execution, and network access can do a lot of damage regardless of why it misbehaves. The job is less to build an impermeable wall than it is to reduce the rate and blast radius of the fuckups by the resident agent.

Mitigations are often layered:

  • At the agent layer. Bounded autonomy: permission gates on destructive tools, plan-mode approval before execution, default-deny pairing for unknown senders, read-only or narrowly-scoped tools, credential proxies so the agent never sees the raw secret, a human in the loop on anything irreversible. These catch good-faith mistakes and the clumsier injections before they fire.
  • At the isolation layer. Run the execution environment in a sandbox or a container, restrict network egress, keep the environment reclaimable. This bounds the damage of a bad action regardless of which failure mode produced it.

None of this is fool-proof. Agents are creative at escaping constraints. A determined agent can still find ways through its own gates, either at the behest of an adversarial prompt or a surfeit of helpful enthusiasm. There is, as often in security, a trade-off between power and convenience and by definition we are using agents because we want them to do powerful things on our behalf, so the more we lock them down, the less useful they become. The defaults tend to be weak, sliding toward “security theatre”. If we need to click the “approve” dialogue box 200 times, how well have we assessed the risks each time? How conversant are we in setting up the right sandboxes for each problem? Real-world sandboxes are routinely escapable and “not especially secure”. Isolation is a reduction of blast radius, not a wall. Zechner argues that once an agent can write and run code the lethal trifecta — private data, code execution, network reach — is already in play, so sprinkling permission prompts through the middle buys at best complexity and at worst an illusion of safety. He tends to favour learning to configure good sandboxes and monitoring the agent’s activity through logs and audit trails over trying to put guardrails on the agent itself.

Other things to think about:

  • Default-deny the untrusted path. Ignore external senders until paired (OpenClaw and Hermes both do this), and run web-touching sessions in a sandbox that can be thrown away. Cowork’s computer-use mode for example, used no sandbox at all, and was demonstrated vulnerable within days of launch.
  • Credentials are vulnerable. Attackers and confused agents are happy to get their hands on our secrets: API keys, SSH keys, OAuth tokens, the .env file, the cloud credentials. These translate into real resources, the ability to impersonate us and outlast the current session. Once an agent has filesystem and network access, every secret sitting in ~/.ssh, ~/.aws, or a .env is one cat and one POST away, which is why “grant the agent our home folder and hope” is such a weak default. Stronger patterns keep the raw secret out of the agent’s reach in a secret manager, dispensing short-lived, scoped tokens. Claude Code on the web routes git auth through such a proxy, so credentials never enter the sandbox; Hermes’s encrypted secret exchange decrypts on the gateway with an ephemeral key, so the cleartext lands inside the sandbox but the agent and the model never see it.
  • Watch the auto-skill-generation loop. When a harness writes its own skills from past tasks (Hermes; MiMo Code’s Dream/Distill passes), one bad prompt can mint a persistent capability the agent reuses later, unprompted. Audit ~/.hermes/skills/, .mimocode/, or the equivalent periodically.

I’m sure there are more failure modes than these; the point is to be thoughtful about the risks and mitigations, and to accept that some risk is inherent in powerful agents until alignment and capabilities both jointly achieve perfection.

6 Feeding documents in

A near-universal operational question for agentic systems: how do PDFs, Word docs, spreadsheets get from the file system into the agent’s context. There are three architectures, and harnesses spread across them.

  1. Native multimodal model. The model itself ingests the document binary (or rendered pages) and processes text plus charts plus layout in one go. Highest fidelity. Only frontier closed-weight models (Claude, GPT-4o, Gemini) currently do this well; most local open-weight models cannot.
  2. Frontend- or harness-side text extraction. The client runs a PDF→text library and drops the result into context. Loses charts and visual layout but is usually fine for prose-heavy documents.
  3. Agent-driven conversion via shell tools. The harness invokes a converter, reads the markdown back, and continues. Composable but more setup.

6.1 Harness-side affordances

Which architecture each harness gives us out of the box:

Harness Ingestion Notes
Qwen-Agent Assistant Built-in RAG files=[long_pdf] runs DocParser + BM25 hybrid retrieval, no vector DB; .pdf/.docx/.pptx/.txt/.csv/.xlsx/.html, 1M-token tested
Claude Desktop Native multimodal Drag-drop; vision models read the document binary, no extraction step
smolagents CodeAgent Agent-driven No pipeline, but the agent writes Python (pypdf, pdfplumber, pandas) in its loop as needed
Hermes, pi, OpenClaw BYO tool No reliable built-in extraction; Hermes’s web_extract handles PDF URLs → markdown, local files via a shell converter or skill

For non-trivial PDFs — maths, scans, complex layout — see PDF ingestion.

7 Should we? When an agent is the right tool

A short pragmatic note before this notebook starts to sound like an unconditional endorsement.

Agents are the right tool when:

  • The task is multi-step and exploratory (debug this, refactor that, find references to X).
  • The intermediate steps have value beyond the final answer (the agent reads files, runs commands, reports what it found).
  • Tool use unlocks capability the model lacks on its own (executing code, querying live data, browser automation).

Agents are not the right tool when:

  • The task is a one-shot transformation (translate this text, summarize this PDF, generate boilerplate).
  • The task is so specific that a hardcoded script is faster and less error-prone.
  • The cost of a wrong action is high and approval gates would make the agent slower than doing the task directly.

The current 2026 mania pushes hard in the direction of “use an agent for everything”; resist where appropriate.

8 Incoming

  • DeepPlanning benchmark (Zhang2026DeepPlanning?)

  • DeepPlanning

  • Qwen/DeepPlanning

  • Advanced Large Language Model Agents

  • Announcing the Agent2Agent Protocol (A2A) - Google Developers Blog

  • Workshop on Agentic AI for Scientific Discovery

  • Agent Laboratory: Using LLM Agents as Research Assistants

    Agent Laboratory takes input from a human-produced research idea and outputs a research report and code repository. Agent Laboratory is meant to assist you as the human researcher in implementing your research ideas. You are the pilot. Agent Laboratory provides a structured framework that adapts to your computational resources, whether you’re running it on a MacBook or on a GPU cluster. Agent Laboratory consists of specialised agents driven by large language models to support you through the entire research workflow—from conducting literature reviews and formulating plans to executing experiments and writing comprehensive reports. This system is not designed to replace your creativity but to complement it, enabling you to focus on ideation and critical thinking while automating repetitive and time-intensive tasks like coding and documentation. By accommodating various levels of computational resources and human involvement, Agent Laboratory aims to accelerate scientific discovery and optimise your research productivity.

  • J-Rosser-UK/AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds (Rosser and Foerster 2025)

    Scaffolding Large Language Models (LLMs) into multi-agent scaffolds often improves performance on complex tasks, but the safety impact of such scaffolds has not been as thoroughly explored. In this paper, we introduce AGENTBREEDER a framework for multi-objective evolutionary search over scaffolds. Our REDAGENTBREEDER evolves scaffolds towards jailbreaking the base LLM while achieving high task success, while BLUEAGENTBREEDER instead aims to combine safety with task reward. We evaluate the scaffolds discovered by the different instances of AGENTBREEDER and popular baselines using widely recognized reasoning, mathematics, and safety benchmarks. Our work highlights and mitigates the safety risks due to multi-agent scaffolding.

  • Why Simulator AIs want to be Active Inference AIs

9 References

Bengio, Cohen, Fornasiere, et al. 2025. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Carey, Langlois, Merwijk, et al. 2025. Incentives for Responsiveness, Instrumental Control and Impact.”
Chen, Dong, Shu, et al. 2023. AutoAgents: A Framework for Automatic Agent Generation.”
Crutchfield, and Jurgens. 2025. Agentic Information Theory: Ergodicity and Intrinsic Semantics of Information Processes.”
Everitt, Garbacea, Bellot, et al. 2025. Evaluating the Goal-Directedness of Large Language Models.”
Guo, Chen, Wang, et al. 2024. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges.”
Hammond, Chan, Clifton, et al. 2025. Multi-Agent Risks from Advanced AI.”
Hyland, Gavenčiak, Costa, et al. 2024. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.
Kalai, and Lehrer. 1993. Rational Learning Leads to Nash Equilibrium.” Econometrica.
Li, Al Kader Hammoud, Itani, et al. 2023. CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society.” In Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23.
Qu, Dai, Wei, et al. 2025. Tool Learning with Large Language Models: A Survey.” Front. Comput. Sci.
Rosser, and Foerster. 2025. AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement.”
Schmidgall, Su, Wang, et al. 2025. Agent Laboratory: Using LLM Agents as Research Assistants.”
Walters, Kaufmann, Sefas, et al. 2025. Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent Study.”
Wu, Bansal, Zhang, et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.”