Kitaru
Agent Harness Platform

Production notes and upgrade paths

Which pieces of the Agent Harness Platform tour are teaching stand-ins, where each one plugs into production, and what to harden before you rely on the pattern

Every stage in this tour runs on a laptop. To keep it that way, the example uses stand-ins: Docker for the sandbox, local markdown files for procedures, a self-signed proxy and mock HTTP services for credentials and internal calls. Those stand-ins are good enough to run and read, but they are not the part you would ship.

The part worth keeping is the shape. Each capability sits behind a small, named swap point, so you can replace the laptop-friendly version with something your platform team trusts without rewriting the agent. This page lists the teaching stand-ins, the production pieces they stand in for, and the place where each swap happens.

The Kitaru primitives do not change: durability, wait/resume, and replay are the same concepts you would use in production. The production work is configuring the runtime and storage properly, then hardening the platform around those primitives. Everything else on this page lives in the example's forkable agent_harness_platform/ library, not in Kitaru itself.

This example is a runnable reference architecture, not a hardened platform. As shipped it cannot safely contain code that is actively trying to break out, and it is not a turnkey production system. Treat the sections below as a starting checklist rather than a guarantee.

Stand-ins at a glance

CapabilityTutorial stand-inWhat a fork swaps inThe seam
Durable execution (Stage 1)granular_checkpoints=False for readable logsThe KitaruAgent default: granular checkpoints for model requests and tool callsthe granular_checkpoints flag
Command sandbox (Stage 2)DockerSandbox on your laptopStronger isolation for risky commandsrun(command) -> ExecResult
Agent procedures (Stage 3)LocalSkillSource over local markdownA reviewed, versioned procedure storeSkillSource.resolve() -> Path
Service credentials (Stage 4)mitmproxy + self-signed CA on a shared Docker networkNetwork isolation, proxy authorization, and production secret handlingSandboxProxyRule plus the proxy container
Structured actions (Stage 5)Typed handlers calling mocks/server.pyReal internal services with per-action authorizationthe exec_service registry
Human approval (Stage 6)Local terminal promptDashboard, CLI, or REST answered by a known operatorkitaru.wait()

The rest of this page walks each row in tour order.

Durable execution

The tour passes granular_checkpoints=False so each agent turn is one log block you can point at while learning. A real fork removes it and takes the KitaruAgent default, where every model request and every tool call gets its own checkpoint and its own cache key. A flow that crashed after the third model request then resumes by replaying only the calls after the third.

The durability story is identical either way. The only change is how finely work is cached and how finely it shows up on the dashboard. Nothing about Kitaru itself changes here. See Stage 1.

Sandbox isolation

The exec tool depends on one method: run(command) -> ExecResult. DockerSandbox is the laptop implementation. Docker gives a real filesystem, process, and network boundary, which contains an agent that runs a bad rm. It is not a wall against code that is actively trying to break out, because the container shares the host's Linux kernel. When the commands matter, swap the backend at that one seam. These common options often move toward stronger isolation, but the right choice depends on your threat model, workload, and platform configuration:

  • Docker is the easy local default and stops accidental damage to the host.
  • gVisor runs each container against a user-space kernel, so syscalls hit gVisor before the host kernel. It keeps the container-shaped experience while adding stronger isolation.
  • Kata Containers wraps each container in a lightweight VM, so a kernel escape lands in the VM rather than on the host.
  • Firecracker gives each run its own minimal microVM with a small attack surface (the technology behind AWS Lambda).
  • Hosted sandboxes such as E2B, Modal, or Daytona run the commands on isolated infrastructure, so nothing executes on your own machine.
  • WebAssembly isolates by default but is a poor fit for arbitrary bash; it suits cases where you can constrain what the agent runs.

The agent's tool wiring does not change when you swap, because the swap happens behind run(command). See Stage 2.

Skill sources

Procedures live behind SkillSource, an alias with a single method, resolve() -> Path. The skill tool only ever sees a directory of markdown; it never learns where that directory came from. The example ships LocalSkillSource for local files. Common fork targets:

  • GitRepoSkillSource clones a versioned skill repo at flow start, so procedures are reviewed through pull requests and shared across teammates and running agents.
  • InlineMarkdownSkillSource bakes the markdown straight into the Profile, which suits one-off agents, tests, or skills generated by another flow.
  • Object storage, Kitaru artifacts, or a container-image bake cover stricter deployment shapes.

Be clear about what Kitaru does and does not do here. Kitaru supplies the durable flow the read-and-act cycle runs inside. The example library supplies Profile.skill_source, SkillSource, LocalSkillSource, profile-gated access to the tool, and the swappable source seam. Kitaru does not ship skill versioning, review, diffs, change history, or any UI that surfaces edits. Those are real concerns, and a SkillSource subclass is where your fork adds them. See Stage 3.

Credential architecture

Stage 4 draws one trust boundary: the credential that authorizes a call lives in the proxy container, and the process running model-chosen commands lives in the worker. The boundary itself is the lesson. The implementation around it is built for a laptop and needs hardening before you lean on it.

What is a stand-in:

  • The sandbox, proxy, and mock all share one Docker network, so the worker can reach the upstream host directly. That direct call fails with 401, because only the proxy can add the Authorization header, but the network path itself is open.
  • The per-run proxy bearer sits in the worker's http_proxy environment variable, so a prompt-injected agent can read it. That bearer only gets a request through the proxy; it does not limit which requests the worker may make.
  • Network reachability is not authentication. This pattern stops raw-token exfiltration, but on its own it is not per-path or per-method authorization.

What a production fork adds:

  • per-role networks, with the worker isolated (optionally under egress policies) and upstreams reachable only from the proxy;
  • host, path, and method allowlists on the proxy, or per-agent rules, so reaching the proxy does not grant arbitrary authenticated calls to an allowed host;
  • credentials mounted as files rather than environment variables;
  • mTLS in place of the per-run basic-auth-as-bearer;
  • the persistent-shell completion signal on a side channel rather than mixed into stdout.

The seam is SandboxProxyRule plus the proxy container. See Stage 4.

Service boundaries

exec is for shell-shaped work the agent reasons about as command output. exec_service is for structured host-side actions: look up a record, file a ticket, publish a summary, call an internal control plane. The typed boundary is the natural place to decide which agent may call which action. In the tour the handlers hit a local mock, and each one holds a single secret resolved from a Kitaru secret. A fork:

  • swaps the mock handlers for real internal services;
  • adds per-service authorization at the boundary, so "may this agent call this action" is answered in one place rather than trusted to the model;
  • keeps shell work on exec and structured work on exec_service, instead of letting the model hand-assemble curl for actions that have a defined input and output.

Adding a service is three files: a handler, an args and result Pydantic pair, and an ALL_SERVICES entry. The tool surface and its description update from the profile's allowed_services. See Stage 5.

Human approval and operator surface

On a laptop the operator answers ask_question at the same terminal. On a server the same kitaru.wait() record is answered through the dashboard, CLI, or REST API, and the non-interactive run in Stage 6 is already that shape. The durable pause already works the same way in production. kitaru.wait() provides the pause/resume mechanism; the operator surface around it enforces identity, authentication, authorization, and audit policy. A real platform also adds more:

  • treat operator input as untrusted and escape it before it reaches anything that interprets bytes, such as an HTML renderer, a shell, or SQL; the example passes it through verbatim;
  • record who approved what when your environment needs an audit trail of approver identity and time;
  • when you drop the teaching-only granular_checkpoints=False, exempt wait-bearing tools so they still run at flow scope, for example tool_checkpoint_config_by_name={"ask_question": False}.

See Stage 6.

What to harden before you rely on this

If you adopt this pattern for real work, work through these in roughly this order, worst blast radius first:

  1. Isolation for command execution. If the agent can run model-authored or untrusted commands that matter, replace Docker with a stronger backend (gVisor, Kata, Firecracker, or a hosted sandbox). Docker alone is not a hostile-code boundary.
  2. The credential and network boundary. Put the worker on its own network, add host, path, and method allowlists to the proxy, move secrets to files or a secret manager, and prefer mTLS. The shared Docker network and the visible proxy bearer are tutorial shortcuts.
  3. Authorization at the service boundary. Decide which agent may call which exec_service action, and enforce it in one place rather than trusting the model to stay in bounds.
  4. Input that crosses a trust boundary. Treat operator answers and service arguments as untrusted data. Validate and escape them before they reach anything that interprets bytes, such as HTML, SQL, or a shell.
  5. Model-authored commands. Do not rely on escaping to make arbitrary shell commands safe. Use sandboxing, command allowlists, typed service alternatives, and careful action design.
  6. Side-effect idempotency. Design shell and API mutations around replay and cache behavior. Actions such as git push, curl POST, database writes, ticket creation, and webhook posts need operation IDs, idempotency keys, deduplication, or checkpoint boundaries that prevent accidental double or missing effects.
  7. Procedure governance. Add the review, provenance, and version pinning that Kitaru does not ship for skills, on whichever SkillSource you choose.
  8. Checkpoint and log data governance. Treat checkpoint data, tool arguments and results, model messages, wait answers, and logs as potentially sensitive. Add retention rules, access controls, redaction or filtering, and deletion or export rules where your environment requires them.
  9. The platform around all of this. Identity, policy, observability, deployment, and a production secret store are out of scope for the example. They are yours to bring.

Kitaru's job in this list is narrow: durable execution, wait/resume, and replay survive process failure for you once the runtime and storage are configured for your environment. The rest is platform work, and the point of the example is that each item above has an obvious place to land.

On this page