Your agent runs.
Now make it survive.
Lightweight durable execution for AI agents in Python. Crash recovery, cost tracking, human-in-the-loop, and full lineage — without the distributed systems baggage.
Open source. Free to start. No credit card required.
You built the agent.
Now run it without duct tape.
Agents aren't microservices. They don't need microservice infrastructure.
Too heavy
Built for microservice transactions, not agents. Python as an afterthought. Weeks to set up. You need a distributed systems degree to debug the event history.
Owns your agent
Opinionated about memory, message format, state schema. When you rewrite your agent next month, you rewrite everything.
Locked to one cloud
AWS Step Functions, Azure Durable Functions, Cloudflare Workers. Different APIs, different limits, no portability.
Months of glue
Temporal + LangSmith + custom retry logic + cost tracking scripts + deployment infra. You're building infrastructure, not agents.
Infrastructure that survives
every framework rewrite.
Wrap your existing agent. Kitaru handles durability, cost tracking, replay, and human-in-the-loop underneath.
Crash recovery without replay complexity
Kitaru checkpoints every step output. On failure, your workflow re-executes and skips completed steps via cache hits. No determinism constraints. No replay brittleness. Deploy new code without breaking running agents.
Cost tracking you didn't have to build
Every LLM call is automatically instrumented. Tokens, cost, latency, model version, prompt hash — all queryable from the metadata store. Sort runs by cost. Set per-agent budgets. No Langfuse bolted on.
Replay from any step. Change the input. Compare.
Your agent made a bad plan? Go back to that step, modify the input, replay from there. Compare both runs side-by-side in the dashboard. Content-addressable, versioned, diffable checkpoints — lineage tracking for agents.
Human-in-the-loop as a primitive
Not a hack, not a webhook — a first-class primitive. wait_for_input() suspends execution, releases compute, and resumes when a human provides input. Hours or days later.
Start local. Deploy anywhere.
No workers. No queues. No BS.
Agents aren't microservices — they don't need microservice infrastructure. No Temporal servers, no worker fleets, no event sourcing. Just kitaru dev on your laptop, then kitaru deploy when you're ready.
What you don't need
- No Temporal server No server cluster to manage, no worker processes to scale
- No message queues No RabbitMQ, no SQS, no Kafka — checkpoints, not events
- No determinism constraints Write normal Python. No replay rules. No side-effect restrictions.
- No vendor lock-in Self-host anywhere. AWS, GCP, Azure, or your own Kubernetes.
Understand your agent
at a glance.
No 500-line event history. No distributed tracing PhD. A clean dashboard that shows exactly what your agent did, what it cost, and where it went wrong.
Agent wants to apply 3 fixes to src/auth.py. Approve?
Not just durability.
The full agent lifecycle.
Built-in tools to build, debug, and iterate on your agents. MCP servers for tool discovery. Skills for reusable capabilities. Replay loops for debugging. Observability integrations for production.
MCP servers
Built-in MCP servers for tool discovery and management. Your agents find and use tools through a standard protocol — no custom integrations.
Debug and replay
Your agent made a bad decision at step 3? Go back, change the input, replay from there. Compare both runs side-by-side. Iterate until it works.
Observability
Plays nicely with your existing observability stack. Export traces, connect to your preferred monitoring tools. We capture the data — you choose where it goes.
Skills and templates
Reusable agent capabilities you can compose. Pre-built skills for common patterns — code review, data analysis, research — customize and extend.
Not another framework.
The layer underneath.
| Temporal | LangGraph | DBOS | Kitaru | |
|---|---|---|---|---|
| Crash recovery | ✓ replay | ~ checkpoints | ✓ DBOS Cloud | ✓ checkpoints |
| Versioned step outputs | ✗ | ✗ | ✗ | ✓ built-in |
| Run diffing | ✗ | ✗ | ✗ | ✓ built-in |
| Cost tracking per run | ✗ | ✗ | ✗ | ✓ automatic |
| Cross-run lineage | ✗ | ✗ | ✗ | ✓ built-in |
| Python-native DX | ~ painful | ✓ | ✓ | ✓ decorators |
| Framework-agnostic | ✓ | ✗ LangChain | ~ DBOS only | ✓ any agent |
| No determinism tax | ✗ strict rules | ✓ | ✓ | ✓ + linter |
| Self-hosted / any cloud | ✓ | ✗ LangSmith | ✗ DBOS Cloud | ✓ any cloud |
Built on the foundation of ZenML.
Battle-tested at scale.
Kitaru is built by the team behind ZenML — the open-source MLOps framework trusted by hundreds of teams to orchestrate production ML pipelines. The same engine that runs thousands of pipelines now powers your agents.
Centralized their AI platform on Kubernetes with ZenML's orchestration engine.
Read case study →Decreased time-to-market from 2 months to 2 weeks with ZenML pipelines.
Read case study →Accelerated model development by 80% using ZenML's orchestration layer.
Read case study →Kitaru uses the same checkpoint engine, metadata store, and cloud connectors that power ZenML — now purpose-built for AI agents that need to run for hours, survive crashes, and scale to thousands of concurrent executions.
Ship agents.
Not infrastructure.
Kitaru is launching soon. Join the waitlist for early access, and we'll tell you the moment it's ready.
Open source. Free to start. No credit card required.
Built by the team behind ZenML — production ML orchestration trusted by hundreds of teams.