AI agents, applied

Stack vocabulary, harnesses, MCP, and the products that wrap them

2025-02-02 — 2026-07-02

In Which the Architecture of Agentic AI Systems Is Surveyed, With Attention to the Tradeoffs Between Observability, Isolation, and Context Management Across Several Competing Harness Frameworks.

AI safety

computers are awful together

faster pussycat

language

machine learning

neural nets

NLP

premature optimization

slop

technology

As is appropriate for the subject, this page is slop — albeit slop distilled from my own notes while I installed things and tried to understand how they worked.

Everyone is using “agentic” AI now (claude desktop, openclaw etc). How do I do that? Should I do that? What is the least worst way to extract value from The Machine without divulging all my secrets to The Man? This notebook collects agent-generic knowledge; later I put it to work running LLMs locally on a Mac and with specialized coding and mathematics agents.

1 Vocabulary

1.1 The stack

Model: the weights themselves — Qwen, DeepSeek, mxbai-embed, Claude (closed weights), GPT (closed weights), etc. Distributed as .safetensors or some equivalent format, typically from Hugging Face.
Runtime / inference engine: the code that uses the weights for inference — llama.cpp, MLX, vLLM, SGLang, mlx-lm, antirez’s ds4 for single-model specialised cases. Where the compute happens.
Server / daemon: a long-lived process that wraps the runtime in an HTTP endpoint (almost always OpenAI- or Anthropic-compatible) — ollama serve, llama-server, mlx_lm.server, Osaurus, ds4-server, vLLM, Unsloth Studio’s local endpoint, hosted endpoints at Anthropic / OpenAI / Google. Stateless from the application’s point of view.
Harness / agent loop: the orchestration layer over the server — manages conversation state, tool calls, system prompts, multi-turn agent loops. Not necessarily obvious.
Frontend / chat client: the human-facing surface — a desktop chat window, a text-mode UI, a multi-channel messaging bridge, a code-editor plugin, a web UI.
Quantization format: how the weights are stored with reduced precision on disk for local execution — GGUF for llama.cpp, MLX safetensors for MLX, JANG for mavericks.

These interact in idiosyncratic ways. Some things interchangeable, others not. The compute parts (model / quantization / runtime) can often be tightly coupled. The pieces nearest the users (server / harness / frontend) generally do not care so much about the details of how the compute happens, and tend to be less coupled.

That said, any components might end up tightly coupled if a particular vendor wants to lock me in to their ecosystem. Many turnkey products are vertical bundles across several layers — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime, Unsloth Studio is runtime + server + fine-tuning UI, Claude Desktop is frontend + harness pointed at Anthropic’s hosted server.

1.2 What a harness does

The harness layer takes a model server (an HTTP endpoint talking in JSON) and uses a repeated cycle of specially-interpret chats to get agentic behaviour out of it. Several affordances are used in modern agents:

Conversation state. Maintain a thread of messages across turns; decide when to compact older turns to fit the context window.
Tool calling. Expose tools (read file, run shell, query database, fetch URL) to the model in its prompt; parse tool-call responses out of the model output; execute them; feed results back as new turns.
System prompt management. Inject project-level instructions (AGENTS.md, CLAUDE.md, SOUL.md), per-session overrides, and skill descriptions before each model call.
Multi-turn loops. Run the model in a loop until it stops asking to run tools, with bounded iterations and error handling.

Harnesses make different design choices:

Opinionation. pi is all primitives, (almost) no features — no built-in MCP, no built-in sub-agents, no built-in plan mode. Claude Code, by contrast, ships with a lot of features — sub-agents, plan mode, agent teams…
Tool-call format. Different harnesses accept different tool-call response formats (JSON, XML, Qwen xml_function, Mistral [TOOL_CALLS], etc.) and have associated different parsing logic.
Skills / extensibility model. How the harness loads user-extensible behaviour (agentskills.io markdown files, TypeScript extensions, Python modules).
Where it runs. pi is a Node CLI; Hermes is a Python long-running gateway; Claude Code runs in an Anthropic desktop tab.

A harness and a model server may be somewhat independent. The same harness can talk to multiple servers (cloud Anthropic today, local Osaurus tomorrow), and the same server can be a backend for several harnesses simultaneously (pi + Cursor + a curl script all pointed at one mlx_lm.server).

1.3 MCP — Model Context Protocol

MCP is an open protocol for connecting LLM applications (clients, harnesses) to data sources and tools (servers).

MCP is an open protocol that standardises how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications.

The architecture is client-server — but client and server can both live on the same laptop and talk over stdio or localhost. A harness (typically but not exclusively) acts as MCP client. An MCP server is a separate process exposing tools, resources, and prompts.

MCP is not strictly necessary nor universal. pi deliberately does not ship it — its position is that an MCP server is just a wrapper around tools that could equally well be exposed as CLI tools with README files (i.e. Skills). Zechner observes that MCP does clog up the context window more.

Some fun MCP servers:

Code-relevant MCP servers — Git-MCP, claude-code-mcp, venv-mcp-server, XcodeBuildMCP, etc.
punkpeye/awesome-mcp-clients — community curated set.

1.4 Skills and agentskills.io

agentskills.io is another open standard for agents. This one packages capabilities into skills — markdown files with YAML frontmatter that describe a capability and its tools, loaded on-demand into an agent’s prompt.

A skill is either a single name.md file or a directory with SKILL.md plus supporting files:

---
name: skill-name
description: Short description of what this skill does
---

# Skill instructions
1. Do this
2. Then that

Skills are relatively transferable. pi, Hermes, Anthropic’s own Skills system, and various community frameworks all consume the same file format; a skill written for Hermes can be dropped into a pi extension directory and Just Work.

The contrast with MCP is one of binding time. Skills are typically loaded on-demand per session, often based on what the user is asking about. A harness might have hundreds of skills installed but only load three for any given session, keeping context usage low.

2 Build agent harnesses

We can just build our own specialised agent harness. Many have. It is not rocket science. I’ve done it. However, they are a little subtle to get right, and have a long tail of annoying difficulties to solve, so, in general, we should probably not?

However, there exist libraries to solve common problems, and basic harnesses that one can customize.

2.1 pi

earendil-works/pi (Mario Zechner / badlogic, MIT, TypeScript / Node) is a minimalist harness. Behaviour beyond the loop-and-tool-calls baseline gets added as skills or TypeScript extensions.

Zechner ships it as the Pi Coding Agent, so pi overtly straddles the line between a generic harness and a coding one.

pi’s entire system prompt plus its four tools (read, bash, edit, write) come in under 1000 tokens, on the argument that frontier models are RL-trained enough to already know what a coding agent is, so a 10k-token system prompt buys little. Zechner backs that with a Terminal-Bench 2.0 run on Opus 4.5 that places pi competitively against harnesses with far more scaffolding.

The minimalism falls out of two commitments Zechner sets out at length. The first is context engineering — exact control over what enters the model’s context, on the premise that mainstream harnesses inject material behind our backs that never surfaces in the UI and degrades the output. The second is observability — being able to inspect every byte of every exchange, with a documented session format we can post-process.

One agent built on pi is the famously bloated OpenClaw, which uses pi as its agent core and adds interface gateways, persistent memory, and the rest.

Affordances:

TypeScript native — agents and tools are TypeScript modules with type-checked tool schemas.
OS-agnostic — Node CLI, runs on Mac / Linux / WSL.
Parallel tool calls — pi can fire several tool calls in a single turn, so any extension that wants parallelism gets it for free.

2.1.1 Subagents and context bloat

pi ships no subagent implementation at all. Its native answer to context bloat is the /tree command: we jump back to an earlier point in the chat history and pi summarizes everything since, collapsing the intervening turns into a précis — many of a subagent’s goals, but explicit.

Zechner objects to classic subagents on two grounds: they are unobservable, and they could instead be constructed explicitly, as separate sessions with shared file context.

No one else is so austere. The popular pi-subagents reinstates the familiar subagent(prompt, ...) tool, the child running on either fresh context (only its instructions) or forked context (a copy of the parent’s window plus instructions). Daniel Nouri’s pi-submarine is a smaller take — fresh/fork context, named agents as markdown files, nested subagents, resumable runs. Both make the subagent’s conversation observable, which addresses most of Zechner’s objection.

2.1.2 YOLO no guardrails

pi runs in full YOLO mode: unrestricted filesystem access, any command executed with our user privileges, no permission prompts, no Haiku pre-screening of bash commands. The rationale is that anything more is just security theatre; if we want a boundary, run pi inside a container, which maybe we should do in general.

2.2 smolagents

smolagents (Hugging Face, Apache 2.0, Python) is also minimalist, but for Python.

The distinguishing feature is the CodeAgent paradigm: instead of the model emitting JSON tool calls that the harness parses and executes, the model emits Python code snippets that get executed in a sandbox (it resembles the TIR loop in mathematical agents).

Tool calls become function calls:

from smolagents import CodeAgent, WebSearchTool, InferenceClientModel

model = InferenceClientModel()
agent = CodeAgent(tools=[WebSearchTool()], model=model, stream_outputs=True)

agent.run("How many seconds would it take for a leopard at full speed to run through Pont des Arts?")

HF’s benchmark claim is that this paradigm uses ~30% fewer model steps than JSON tool-calling on difficult agentic benchmarks.

Affordances:

Model-agnostic — any HF Hub model via InferenceClientModel, OpenAI / Anthropic / Bedrock via the LiteLLM integration, local execution via transformers or ollama.
Tool-agnostic — MCP servers, LangChain tools, and HF Spaces all work as tools.
Modality-agnostic — text, vision, video, audio inputs.

Arbitrary Python execution from a language model is at least as risky as it sounds. For real isolation, smolagents supports various managed sandboxes — Modal, E2B, Blaxel — plus Docker for self-hosting.

A ToolCallingAgent is also available alongside CodeAgent for models fine-tuned for the classical JSON paradigm. There is an interactive CLI smolagent and also webagent, a Helium-based web-browsing agent.

observing smolagents

The base loop streams rich-formatted step panels to the console (one per thought / code-execution / observation), and after a run we can read agent.logs or agent.write_memory_to_messages() for the structured trace. more elaborately, we can add GradioUI. GradioUI(agent).launch() opens a web chat that visualises each thought and tool call live, keeps the conversation going across turns (via reset=False), and exposes agent.interrupt() as a stop button. For production oversight, smolagents emits OpenTelemetry traces: pip install ‘smolagents[telemetry]’, call SmolagentsInstrumentor().instrument() once, and every step, tool call, and token count streams into some backend, e.g. Arize Phoenix, Langfuse, or MLflow’s mlflow.smolagents.autolog().

2.3 Qwen-Agent

Qwen-Agent (Alibaba, Apache 2.0, Python) is the agent framework the Qwen team built for their own Qwen models; it is the backend of Qwen Chat. Just like smolagents, it is a Python library. It is not to be confused with the Qwen Code CLI, a different codebase.

There is no desktop app and no standalone CLI . The unit of work is a short Python script that instantiates an agent object, which we then either drive from a terminal chat loop or pass to WebUI(bot).run() to launch a Gradio web UI from the same object. The quickstart is eight lines that do both; the examples/ directory has more elaborate demos.

Install like any other PyPI package. The maximally featureful incantation is

uv add "qwen-agent[gui,rag,code_interpreter,mcp]"

The four optional bracketed extras are the framework’s bonus features — gui (the Gradio UI), rag (long-document retrieval), code_interpreter (Docker-sandboxed Python), mcp (MCP client). Drop any we do not need; bare uv add qwen-agent gives a minimal function-calling core.

Model configuration is via the llm_cfg dict.

Picking an agent class goes as follows:

Assistant is the standard entry point — a general tool-using agent that also reads files. Passing files=[...] at construction time handles PDF / Office / image input via the RAG pipeline. Most demos use this class; start here.
FnCallAgent is the minimal function-calling specialist — barely more than a loop around the function-call API, for when we want full control.
ReActChat implements the ReAct paradigm (interleaving reasoning and action via a scratchpad). Targets backend modes that are not fine-tuned for native function calls and thus we need to coax tool use through prompting.

We extend it through register_tool / BaseTool (source) plus MCP — but agentskills.io skills are unsupported

Three nifty affordances:

Long-document RAG.. files=[long_pdf] runs a built-in hybrid-retrieval pipeline — no vector DB to wire up, you get this for free.
BrowserQwen. A Chrome extension that controls the browser — reads pages, summarises, navigates etc.
Math TIR. The stack fine-tunes Qwen2.5-Math for tool-integrated reasoning — interleaving natural-language reasoning with Python (SymPy checks, numerical sanity, integration) via code_interpreter, which runs in a Docker sandbox.

The framework grows a new demo with each Qwen release — Qwen3-Coder native tool-call parsing, Qwen3-VL vision tool-calls, etc.

2.3.1 Driving non-Qwen models

Qwen-Agent reputedly works best with Qwen models — the default tool-call template(fncall_prompt_type='nous') is tuned for Qwen3 / Qwen3-Coder / QwQ-32B. It will drive other models through an OpenAI-compatible endpoint (model_type: 'oai' above), but it might be janky. By default Qwen-Agent parses tool calls out of the raw model text itself; to defer instead to the server’s own native tool-call interface — needed for non-Qwen models, or for Qwen served behind vLLM’s built-in parser — we set use_raw_api: True in generate_cfg. This is set up in llm_cfg. Qwen models are well supported on Apple Silicon, so for local use the Qwen-on-Qwen path is likely smooth.

2.4 Heavier orchestration

pi, smolagents, and Qwen-Agent are relatively simple structures, mostly a loop plus some tools. There exists a more industrial category of agent orchestration: LangGraph, CrewAI, AutoGen, LlamaIndex, the role-playing multi-agent frameworks. These do a lot. e.g. LangGraph models the agent as a state machine (nodes, conditional edges, cycles, persistence underneath): a run can checkpoint, survive a process restart, and resume from the last node, optionally pausing for a human.

CrewAI and Swarms scaffold some other interesting design patterns, e.g. several named agents passing work between them.

Too heavy for any use I have but might be interesting for specialist use.

2.5 Picking a library

A rough decision rule:

TypeScript, want maximal control and observability → pi.
Python-native computational work — data wrangling, plotting, NumPy/pandas in the loop, cross-vendor flexibility → smolagents; the CodeAgent paradigm fits when the natural step is “run this snippet” rather than “call this named tool.”
Running Qwen-family models locally, want batteries included — long-doc RAG, sandboxed code interpreter, Gradio UI, multimodal tool calls → Qwen-Agent.
The job needs deterministic branching → Dynamic-Workflow scripts.
Checkpoint-and-resume-across-restarts needed → escalate to LangGraph.

I would prototype on pi or smolagents and keep it simple unless I had reasons to escalate.

3 Ready-to-run harnesses

Between building our own (above) and the always-on assistants (below) is a middle category: finished, generic harnesses we point at any OpenAI- or Anthropic-compatible server and simply use. Coding is an obvious application but not the only one. pi straddles the line — minimal enough to count as a library to build on, complete enough to run as-is. The code-native tools — Aider, Cline — are in the coding notebook; OpenCode and Goose below I’ll admit as general-purpose agents that happen to be good at code.

3.1 OpenCode

OpenCode (anomalyco/opencode, MIT) is a terminal harness supporting 75+ providers, local models included. It wraps LSP, MCP, and a plugin system, and has a massive community. We point it at any endpoint by adding a custom provider with a baseURL in ~/.config/opencode/opencode.json.

curl -fsSL https://opencode.ai/install | bash
# or: npm install -g opencode-ai
# or: brew install anomalyco/tap/opencode

Pro-tip: the canonical repo is anomalyco/opencode; AFAICT opencode-ai/opencode is name-squatting.

Xiaomi’s MiMo Code is a fork that adds long-horizon memory and a self-improvement layer

3.2 Goose

Goose (source at aaif-goose/goose, Apache-2.0, Rust) is a grown-up member of the bunch, in the sense that it pays its taxes and goes to meetings. By which I mean, it has been adopted by the Linux Foundation. Unlike the Python and Node harnesses above, it is a native desktop app and a CLI and an embeddable API.

brew install --cask block-goose   # desktop app + CLI
# CLI only: curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh | bash

It works with many model providers, and notably will ride an existing Claude / ChatGPT / Gemini subscription through the Agent Client Protocol rather than via metered API key. The ACP runs both ways: Goose can also back an ACP-speaking editor like Zed or JetBrains.

Extensibility seems reasonable. It supports 70+ MCP servers.

Skills are supported through the built-in Summon extension, reading them from ~/.agents/skills/ globally and .agents/skills/ per project, with backward-compatible discovery of .claude/skills/ and friends.

It also supports a Goose-specific format, recipes, that package a prompt, parameters, tools, and extensions into a one-click shareable workflow, with subrecipes for fanning work out across subagents. 🚧TODO🚧: I am unclear on the difference between a recipe and a skill.

As part of its hacker aesthetic, the docs diffuse the pitch across many pages; the Goose Janitor write-up might be a good central depot.

3.3 Open WebUI

Open WebUI (open-webui/open-webui, open-source, self-hostable) is venerable in chat years, dating all the way back to 2023. If I am not mistaken, it grew out of an attempt to provide a chat UX for any by-the-token chat API in the browser. It may still function as such? At some point the UI grew a genuine agentic harness, speaking MCP natively, executing Python code, and built-in document RAG. The feature set skews toward tool-using conversation over the autonomous long-horizon machinery over the agency-first design of more recent agents. It seems especially well-fitted to being an agentic mathematics frontend, with the best equation rendering out of all the options here.

It shows the signs of a long and chaotic evolution, comprising a messy nest of Node packages, Python scripts, and weird version requirements. I suspect we don’t want to see how this sausage is made.

3.3.1 Setup

Most Open WebUI walkthroughs bundle an extra copy of ollama for shits and giggles. This is usually not what I want, because I already over-engineered too many token serving options, plus fall back to OpenRouter.

Execution as a python app

ENABLE_OLLAMA_API=False WEBUI_AUTH=False DATA_DIR=~/.open-webui \
  uvx --python 3.11 open-webui@latest serve --host 127.0.0.1 --port 8111

or containerized:

docker run -d --name open-webui \
  -p 127.0.0.1:8111:8080 \
  -v open-webui:/app/backend/data \
  -e ENABLE_OLLAMA_API=False \
  -e WEBUI_AUTH=False \
  ghcr.io/open-webui/open-webui:main

Breaking that down:

DATA_DIR is where chats and settings persist; without it uvx can leave them in a cache directory that evaporates. The docker version persists data wherever docker stores -v open-webui:/app/backend/data
Setting WEBUI_AUTH=False/-e WEBUI_AUTH=False at first launch skips login for a single-user laptop install — an ordinary environment variable on the uvx command line, -e WEBUI_AUTH=False on the docker run. Once initialised in single-user mode this is permanent — the choice is permanent per datastore.
ENABLE_OLLAMA_API=False stops the app probing for the Ollama we decided not to run.
The UI is served per default promiscuously on 0.0.0.0:8080, --port 8111 moves it, and --host 127.0.0.1 restricts it to loopback, which is what we want on a laptop we are not deliberately sharing. The container always listens on 8080 inside itself; the host-side port is the left half of -p, so -p 9999:8080 publishes it on 9999.

There is also an alpha desktop app (Electron, AGPL-3.0, macOS / Windows / Linux) that wraps the same web UI in a native window.

Backends are configured in the UI under User Setting → Admin Settings → Connections → OpenAI:

OpenRouter — URL https://openrouter.ai/api/v1 plus an API key. OpenRouter exposes thousands of models, which swamps the model picker and makes page loads crawl; add the handful of model IDs we use to the connection’s Model IDs allowlist and switch on Cache Base Model List (or ENABLE_BASE_MODELS_CACHE=True).
A local server — vllm-mlx or oMLX at http://localhost:8000/v1, Osaurus at http://localhost:1337/v1; API key blank (or the --api-key we launched the server with if we did that) From inside the container, localhost is the container itself — the host’s endpoint is http://host.docker.internal:8000/v1. These servers all implement /v1/models, so the model names auto-detect and populate the picker.

Pro-tip: curl http://localhost:8000/v1/models | jq '.data[].id' shows what the picker will see before we touch the UI.

Fun feature: equation rendering works offline with nothing to configure. Open WebUI bundles KaTeX — the library, the mhchem chemistry extension, the stylesheet, and the maths fonts — into its frontend build.

4 Personal AI assistants

A category of agent product distinct from coding assistants: the always-on, multi-channel personal AI that learns about us over time.

The shape is:

A long-running daemon on infrastructure I own (or rent cheaply — a $5 VPS, a home server, a Modal pod that hibernates when idle).
Multi-channel deployment — instead of a desktop window, the agent listens on Telegram, Discord, Slack, WhatsApp, Signal, iMessage, email — simultaneously.
Persistent memory across sessions, often with auto-generated skills.
Sometimes scheduled cron-style automations: daily reports, nightly backups, weekly audits.

4.1 Claude Desktop (Cowork, Code, Remote)

The Anthropic bundle has several distinct surfaces under one app.

Cowork is the personal-assistant tab — “Claude Code for non-developers.” Describe a multi-step task, grant Claude access to a folder, walk away while it works. Office integrations for Word / Excel / PowerPoint / Outlook plus Chrome are built-in. Closed-source, Anthropic-only, Claude as the model. The polished commercial option.

Within the one bundle, three-and-a-half different execution environments:

Cowork (local). Runs on the user’s machine. Code execution goes through a local VM Claude manages; computer-use does not — it is molesting my real actual screen, using my actual apps. Within days of launch, Cowork was demonstrated to be vulnerable to prompt injection from web pages it visited. Cool.
- Variant: Remote Control. Phone or browser drives a local Claude Code session; the agent still runs on the user’s machine, but the phone is now a remote
Claude Code (local CLI). Several isolation tiers: sandboxed Bash (Seatbelt on macOS, bubblewrap on Linux), full process sandbox, dev container, custom container, full VM. The default “sandboxed Bash” tier only sandboxes Bash — other built-in tools (Read, Edit, WebFetch) run unsandboxed in the parent process. Several exploits have been published for that fella too.
Claude Code on the web (claude.ai/code). Each session runs in a fresh Anthropic-managed VM with the repo cloned through a credential proxy. Git credentials never enter the sandbox — git auth goes through a proxy with scoped tokens (unclear to me how credentials for other services are managed though; access tokens need to meet the code some time). Network access is limited by default; can be disabled entirely. Still, the strongest isolation tier Anthropic offers natively.

4.2 Hermes Agent

NousResearch/hermes-agent — FOSS, MIT, Python 3.11 + uv. Nous’s own codebase end-to-end. Designed to run on infrastructure I own (a $5 VPS, a home server, my laptop, a Modal pod).

git clone https://github.com/NousResearch/hermes-agent.git
cd hermes-agent
./setup-hermes.sh     # installs uv, creates venv, installs .[all], symlinks ~/.local/bin/hermes
./hermes              # auto-detects the venv, no need to `source` first

Distinctive weirdness:

Model-agnostic — 15+ providers plus arbitrary OpenAI- or Anthropic-compatible endpoints, switchable mid-session.
Multi-channel gateway — Telegram, Discord, Slack, WhatsApp, Signal, email, CLI all from one long-running process.
Closed learning loop — auto-generated skills from experience, FTS5 session search, persistent memory via Honcho dialectic user modelling. The pitch: the longer it runs, the more it knows about me.
Serverless deployment — Modal and Daytona backends with hibernate-on-idle so the agent costs cents between sessions.
MCP-native — first class, not bolted on.
hermes claw migrate — an explicit migration tool from OpenClaw, the giveaway about who they see as the user base they’re courting.

Hermes ships the agent loop separately from the execution sandbox. Six backends, picked via terminal.backend in config:

Backend	Where it runs	Isolation	Setup
`local`	Host as the user	None	Default — for testing only
`docker`	Single persistent container	Linux namespaces, dropped caps	Docker installed
`ssh`	Remote box the user owns	Network boundary	SSH key + host config
`modal`	Modal cloud VM per task	Strongest	Modal account
`daytona`	Daytona managed workspace	Strong; resumable	Daytona API key
`singularity`	HPC-style Apptainer container	Namespace isolation without `$HOME`	`apptainer` installed

A subtle but useful detail: remote backends sync touched files back to the host on teardown into ~/.hermes/cache/remote-syncs/<session-id>/. No need to remember to scp artifacts off the cloud sandbox manually.

Hermes borrows OpenClaw’s DM pairing pattern and adds an encrypted secret-exchange flow for credentials the agent needs at runtime — the user pastes a secret into pi.dev/secret, gets an encrypted blob, pastes it into chat, and the gateway decrypts it locally with an ephemeral private key, storing the cleartext inside the sandbox without the agent itself ever seeing it.

4.3 OpenClaw

openclaw/openclaw — Peter Steinberger’s personal AI assistant project, built on pi. Same shape as Hermes (multi-channel, persistent, owns its own infrastructure) but more cowboy.

The long-running process is a single gateway daemon installed as launchd (macOS) or systemd (Linux) user service. The gateway routes messages from ~25 channels (WhatsApp, Telegram, Slack, Discord, iMessage, Matrix, WeChat, …) to agent sessions, which can be sandboxed with Docker, SSH, or OpenShell.

Default isolation: the main session (interactive use by the owner) runs tools on the host with no sandbox. Non-main sessions (group chats, automation, external users) get sandboxed if the operator opts in (agents.defaults.sandbox.mode: “non-main”). Default deny list for sandboxed sessions covers browser, canvas, nodes, cron, discord, gateway.

Cool tricks:

DM pairing. Unknown senders on any channel get a pairing code and the bot ignores them until openclaw pairing approve adds them to a local allowlist.
Companion apps. Optional macOS menu-bar app, paired iOS/Android nodes — useful if someone in the household wants the assistant in their pocket.

No cloud-VM backend; for remote execution OpenClaw points at a box the operator already controls via SSH.

4.4 Picking by what we want to do

If we’re building from scratch on top of pi, OpenClaw is the reference implementation; if we’re picking a finished product, Hermes is the from-scratch Python alternative and Cowork is the closed commercial one. Anchored on the well-known options:

I want Claude to do stuff with my files and apps while I work on something else. Cowork. Accept the security caveats; do not let it browse untrusted websites.
I want to ship a coding task to a hosted sandbox where my credentials never leave my laptop. Claude Code on the web.
I want an assistant that messages me on WhatsApp / iMessage / Telegram and runs on my own infrastructure. OpenClaw if channel breadth matters most; Hermes if isolation choice matters most.
I want the same agent reachable from my phone, running on a Modal pod that hibernates between sessions. Hermes with modal backend.
I want a small group to share one assistant across Slack and Discord, with new members onboarded via DM pairing. OpenClaw, sandbox non-main sessions, channel allowlists.
I want the assistant backed by my own local LLM. Hermes or OpenClaw — drop the local-server URL (Osaurus, Ollama) into the model config and it is just another provider, with cloud fallback. Cowork cannot: it is locked to Anthropic’s hosted Claude.

5 Where the agent lives

5.1 Moving parts

For a long-running agent — one that holds state across days, runs unattended, accepts inbound messages — “the agent” is at least four moving parts:

The foreground surface the user types into. Short-lived per session.
The daemon or gateway that holds state, routes messages, manages sandboxes. May or may not exist (Claude Code without web mode has no daemon; Hermes and OpenClaw revolve around theirs).
The execution sandbox where shell commands and code actually run. Local, containerised, or a cloud VM.
The channel surface — Telegram, Slack, Discord, email, a web UI, a phone.

Our example agents use these layers very differently.

5.2 Design patterns — sub-agents, plan mode, permission gates, compaction

A loose grab-bag of design patterns at the harness layer that have crystallised into industry conventions:

Sub-agents. Spawn an isolated child agent for a sub-task — its own conversation history, its own tool budget, returning a summary back to the parent. This buys context economy: the dirty work (grepping code, searching the web, writing throwaway spikes) goes to a child while the parent’s context stays lean enough to keep planning. Useful for “explore the codebase and find X” tasks that would otherwise pollute the parent’s context with read-tool calls. Claude Code supports this natively as agent teams; pi does not, on principle, but an extension adds it.

Plan mode. The agent writes out a plan, the user approves it (possibly editing), then the agent executes. Reduces wasted runs on tasks the agent has misunderstood.

Permission gates. Tool calls require user approval before execution. Most harnesses ship some form of this for “destructive” tools (file write, shell exec). The trade-off is more permission prompts vs more autonomy.

Compaction. When context approaches the model’s window limit, summarize older turns to free space. Done well, transparent; done badly, the agent loses the thread.

5.3 Long-horizon agents

A long-running agent — one that holds a task across hundreds of turns, or returns to the same project next week — hits three distinct barriers. The context window fills up (memory); single-turn decisions get less reliable the longer the run (reliability); and whatever the agent worked out evaporates when the session ends (evolution) because current LLMs are not continual learners. We owe this division to Xiaomi’s analysis for MiMo Code.

5.3.1 Memory

Ideally we hold state across the session, even when it blows our context budget. The simplest way to do this is compaction: summarize the old turns and carry on. But summarize-the-old-turns compaction degrades on long tasks.

🚧TODO🚧: get a decent paper summarizing how.

pi’s /tree is one solution, where we constantly select context manually

MiMo Code’s answer moves memory out of the main loop entirely. The main agent keeps no notes of its own. A separate writer subagent, fired by the runtime at fixed points — roughly 20%, 45%, and 70% of the context budget, deliberately early, while there is still room to think and before the lost-in-the-middle effect erodes the extraction — reads the conversation and writes a structured checkpoint to disk. As the window fills, the runtime opens a fresh one and rebuilds context from those files, so from the model’s point of view the conversation never breaks. Memory spans four layers on different lifecycles: a session checkpoint (checkpoint.md), persistent project knowledge (MEMORY.md), cross-project preferences, and a full unindexed SQLite trace of every message and tool call underneath, as the fallback when something is missing from the structured layers. I assume there is some RAG somewhere in the mix?

5.3.2 Multi-turn reliability

The second wall is reliability over long tasks. A long-horizon run is a chain of individual turns — read context, decide, call a tool, repeat. Each turn accumulates some additional chance of a wrong call: editing the wrong file, passing a test by changing the test… Those per-turn odds compound, so even a low rate dominates a long enough run — at i.i.d. error rate of 2% per turn, a 200-turn task finishes clean only about 2% of the time ($0.98^{200}\approx0.02$). A short interactive session is easy to correct — we, the humans, spot the bad turn and fix it. But the economic benefit of agents that they aim to sell us on is doing unattended multi-hour runs. In that case, we really want to juice that error rate down low to top out the famous reliable task-length metric.

MiMo Code attempts to deal with this in a few ways

Goal is their evolution of the Ralph Wiggum Loop: a stopping-condition verifier. The user states a done-condition in natural language, and each time the agent tries to stop, an independent model call checks the whole history against it. If the work is not finished, it whinges about any shortcomings and tries again.
Max mode also spends extra inference-time compute, sampling several candidate plans in parallel and using a low-temperature judge to choose one, which achieves a marginal gain at a large cost (possibly worth it for a high-stakes task).
Dynamic Workflow handles orchestration at scale: the main agent emits a JavaScript script — agent(), parallel(), pipeline(), workflow() — run deterministically in a sandbox, so branch and retry logic is guaranteed by code rather than by a model remembering to follow a SKILL.md.

That code-beats-prompt principle for predictable control flow is originally Anthropic’s.

5.3.3 Evolution — learning across sessions

OK, now ultra-long horizons. Not just tasks, but whole projects, or careers. If we come back to the same project regularly and each time the agent has forgotten everything, it feels like we’re leaving performance on the table, re-deriving the same constraints and repeating the same mistakes. MiMo runs two background agents against its own history to fix this. Dream fires every seven days — it reads past sessions and the memory file, then merges, deduplicates, checks that file references still resolve, and compresses the result back down. Distill fires every thirty days and hunts for process rather than facts: recurring work patterns that it solidifies into reusable skills, CLI commands, and SOP documents.

Hermes takes a different approach, learning the user rather than the project, accreting auto-generated skills, FTS5-searchable session history, and a persistent model of our preferences via Honcho’s dialectic user-modelling, on the pitch that the longer it runs the more it knows about us. OpenClaw keeps persistent memory in the same spirit. A self-writing skill library is also a standing risk: one bad prompt can mint a capability the agent reuses later, unprompted.

5.4 Hosting the gateway

The gateway has to be a live process for the messaging surfaces to work. Telegram, Discord, Slack, and friends need either a long-polling connection (the gateway dials out and waits) or a webhook endpoint (the messaging provider POSTs in) — both require an always-on Python process. Sub-agents on Modal or Daytona can hibernate; the parent gateway cannot, not if it expects to receive our next “hey, status?” from the phone. When my laptop sleeps, the bot is dead.

Five options for hosting the always-on gateway:

A $4–5/mo VPS (Hetzner, DigitalOcean, OVHCloud). The boring correct answer. Public IP, webhook channels Just Work, Hermes installs in a one-liner under systemd. Six months later we will have forgotten the VPS exists, which is the point.
A Raspberry Pi or repurposed laptop at home, with Tailscale Funnel or Cloudflare Tunnel handling the inbound webhook problem so we do not have to forward ports through our home router. ~$50–100 one-time + ~$1–2/mo electricity. Privacy wins; ISP and power outages become our problem. An old laptop with the lid closed and systemd-inhibit handle-lid-switch works just as well as a Pi.
Hermes deployed on Modal in webhook mode. Hibernates between messages, cold-starts on inbound, costs single-digit dollars per month for bursty personal use. Trade: ~2–5s cold-start latency on the first message after a quiet period.
Existing hardware — a NAS, a Mac mini behind the TV, an Intel NUC. Marginal cost is a few watts. Operational cost is whatever our tolerance is for “this is the box that holds the assistant; please don’t unplug it”.
Not a 24/7 RunPod instance. RunPod is GPU rental at $0.20–$4/hour — fine if we are also hosting a local model on the same box, but wasteful for a Python process that just proxies API calls. The right RunPod pattern is “spin up a GPU pod for inference, tear it down between uses” (a model-server play, not a gateway-hosting one).

For the “I want both the model AND the gateway on the same box” case, the question shifts from “where do I host the gateway” to “where do I host the model” — see running LLMs locally on a Mac and the Australian sovereign-LLM project for that side.

5.5 What can go wrong

The vulnerability surface of an agent is unusually broad — it spans the model, the tools it runs, the inputs it reads, and the vendor behind it. If classic computer security is about keeping intruders out of the house then agent security is more like managing a toddler who might let intruders in, burn it down, or open the door and run into traffic.

There are at least four kinds of failures that I have fretted about. Surely more exist in the wild.

The model makes mistakes. It misreads the task and does something destructive in good faith — deletes the wrong directory, force-pushes over someone’s work, pastes a secret into a public channel, emails the draft to the client instead of to us.
The code it runs is wrong. An agent that installs and runs software in good faith inherits that software’s bugs and side effects. A correct decision to call a tool is still only as safe as the tool itself.
The input is hostile. Web pages, emails, Slack messages, and documents can carry prompt injection that redirects the agent’s tools toward someone else’s goals. This is the famed prompt injection attack.
The provider is a trust assumption. Unless we are running the model ourselves, every prompt, file, and pasted secret the agent touches also goes to the token vendor. That makes the vendor a high-value target — a breach there can expose many customers’ data at once. Moreover, they are a party whose interests need not match ours: logs can be retained, scanned for training data, mined by an insider, or handed over under subpoena. Not divulging all our secrets to the Man is not a solved problem; self-hosting relocates that trust under our own roof rather than removing it.

tl;dr An agent wiring together our private data, code execution, and network access can do a lot of damage regardless of why it misbehaves. The job is less to build an impermeable wall than to reduce the rate and blast radius of the agent’s fuckups.

Mitigations are often layered:

At the agent layer. Bounded autonomy: permission gates on destructive tools, plan-mode approval before execution, default-deny pairing for unknown senders, read-only or narrowly-scoped tools, credential proxies so the agent never sees the raw secret, a human in the loop on anything irreversible. These catch good-faith mistakes and the clumsier injections before they fire.
At the isolation layer. Run the execution environment in a sandbox or a container, restrict network egress, keep the environment reclaimable. This bounds the damage of a bad action regardless of which failure mode produced it.

None of this is foolproof. Agents are creative at escaping constraints. A determined agent can still find ways through its own gates, either at the behest of an adversarial prompt or a surfeit of helpful enthusiasm. There is, as often in security, a trade-off between power and convenience, and by definition we are using agents because we want them to do powerful things on our behalf, so the more we lock them down, the less useful they become. The defaults tend to be weak, sliding toward “security theatre”. If we need to click the “approve” dialogue box 200 times, how well have we assessed the risks each time? How conversant are we in setting up the right sandboxes for each problem? Real-world sandboxes are routinely escapable and “not especially secure”. Isolation is a reduction of blast radius, not a wall. Zechner argues that once an agent can write and run code the lethal trifecta — private data, code execution, network reach — is already in play, so sprinkling permission prompts through the middle buys at best complexity and at worst an illusion of safety. He tends to favour learning to configure good sandboxes and monitoring the agent’s activity through logs and audit trails over trying to put guardrails on the agent itself.

Other things to think about:

Default-deny the untrusted path. Ignore external senders until paired (OpenClaw and Hermes both do this), and run web-touching sessions in a sandbox that can be thrown away. Cowork’s computer-use mode, for example, used no sandbox at all, and was demonstrated vulnerable within days of launch.
Credentials are vulnerable. Attackers and confused agents are happy to get their hands on our secrets: API keys, SSH keys, OAuth tokens, the .env file, the cloud credentials. These translate into real resources, the ability to impersonate us and outlast the current session. Once an agent has filesystem and network access, every secret in ~/.ssh, ~/.aws, or a .env is one cat and one POST away, which is why “grant the agent our home folder and hope” is such a weak default. Stronger patterns keep the raw secret out of the agent’s reach in a secret manager, dispensing short-lived, scoped tokens. Claude Code on the web routes git auth through such a proxy, so credentials never enter the sandbox; Hermes’s encrypted secret exchange decrypts on the gateway with an ephemeral key, so the cleartext lands inside the sandbox but the agent and the model never see it.
Watch the auto-skill-generation loop. When a harness writes its own skills from past tasks (Hermes; MiMo Code’s Dream/Distill passes), one bad prompt can mint a persistent capability the agent reuses later, unprompted. Audit ~/.hermes/skills/, .mimocode/, or the equivalent periodically.

I’m sure there are more failure modes than these; the point is to be thoughtful about the risks and mitigations, and to accept that some risk is inherent in powerful agents until alignment and capabilities both jointly achieve perfection.

6 Feeding documents in

A near-universal operational question for agentic systems: how do PDFs, Word docs, spreadsheets get from the file system into the agent’s context. There are three architectures, and harnesses spread across them.

Native multimodal model. The model itself ingests the document binary (or rendered pages) and processes text plus charts plus layout in one go. Highest fidelity. Only frontier closed-weight models (Claude, GPT-5.x, Gemini) currently do this well; most local open-weight models cannot.
Frontend- or harness-side text extraction. The client runs a PDF→text library and drops the result into context. Loses charts and visual layout but is usually fine for prose-heavy documents.
Agent-driven conversion via shell tools. The harness invokes a converter, reads the markdown back, and continues. Composable but more setup.

6.1 Harness-side affordances

Which architecture each harness gives us out of the box:

Harness	Ingestion	Notes
Qwen-Agent `Assistant`	Built-in RAG	`files=[long_pdf]` runs DocParser + BM25 hybrid retrieval, no vector DB; `.pdf/.docx/.pptx/.txt/.csv/.xlsx/.html`, 1M-token tested
Claude Desktop	Native multimodal	Drag-drop; vision models read the document binary, no extraction step
smolagents `CodeAgent`	Agent-driven	No pipeline, but the agent writes Python (`pypdf`, `pdfplumber`, `pandas`) in its loop as needed
Hermes, pi, OpenClaw	BYO tool	No reliable built-in extraction; Hermes’s `web_extract` handles PDF URLs → markdown, local files via a shell converter or skill

For non-trivial PDFs — maths, scans, complex layout — see PDF ingestion.

7 Should we? When an agent is the right tool

A short pragmatic note before this notebook starts to sound like an unconditional endorsement.

Agents are the right tool when:

The task is multi-step and exploratory (debug this, refactor that, find references to X).
The intermediate steps have value beyond the final answer (the agent reads files, runs commands, reports what it found).
Tool use unlocks capability the model lacks on its own (executing code, querying live data, browser automation).

Agents are not the right tool when:

The task is a one-shot transformation (translate this text, summarize this PDF, generate boilerplate).
The task is so specific that a hardcoded script is faster and less error-prone.
The cost of a wrong action is high and approval gates would make the agent slower than doing the task directly.

The current 2026 mania pushes hard in the direction of “use an agent for everything”; resist where appropriate.

8 Incoming

DeepPlanning benchmark (Zhang et al. 2026)
DeepPlanning
Qwen/DeepPlanning
Advanced Large Language Model Agents
Announcing the Agent2Agent Protocol (A2A) - Google Developers Blog
J-Rosser-UK/AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds (Rosser and Foerster 2025)

Scaffolding Large Language Models (LLMs) into multi-agent scaffolds often improves performance on complex tasks, but the safety impact of such scaffolds has not been as thoroughly explored. In this paper, we introduce AGENTBREEDER a framework for multi-objective evolutionary search over scaffolds. Our REDAGENTBREEDER evolves scaffolds towards jailbreaking the base LLM while achieving high task success, while BLUEAGENTBREEDER instead aims to combine safety with task reward. We evaluate the scaffolds discovered by the different instances of AGENTBREEDER and popular baselines using widely recognized reasoning, mathematics, and safety benchmarks. Our work highlights and mitigates the safety risks due to multi-agent scaffolding.
Why Simulator AIs want to be Active Inference AIs

9 References

Bengio, Cohen, Fornasiere, et al. 2025. “Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?”

Carey, Langlois, Merwijk, et al. 2025. “Incentives for Responsiveness, Instrumental Control and Impact.”

Chen, Dong, Shu, et al. 2023. “AutoAgents: A Framework for Automatic Agent Generation.”

Crutchfield, and Jurgens. 2025. “Agentic Information Theory: Ergodicity and Intrinsic Semantics of Information Processes.”

Everitt, Garbacea, Bellot, et al. 2025. “Evaluating the Goal-Directedness of Large Language Models.”

Guo, Chen, Wang, et al. 2024. “Large Language Model Based Multi-Agents: A Survey of Progress and Challenges.”

Hammond, Chan, Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.”

Hyland, Gavenčiak, Costa, et al. 2024. “Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.

Kalai, and Lehrer. 1993. “Rational Learning Leads to Nash Equilibrium.” Econometrica.

Li, Al Kader Hammoud, Itani, et al. 2023. “CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society.” In Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23.

Qu, Dai, Wei, et al. 2025. “Tool Learning with Large Language Models: A Survey.” Front. Comput. Sci.

Rosser, and Foerster. 2025. “AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement.”

Schmidgall, Su, Wang, et al. 2025. “Agent Laboratory: Using LLM Agents as Research Assistants.”

Walters, Kaufmann, Sefas, et al. 2025. “Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent Study.”

Wu, Bansal, Zhang, et al. 2023. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.”

Zhang, Jiang, Li, et al. 2026. “DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints.”