You build your agents.
We make them durable.
Open-source durable execution for Python agents. Checkpoint, replay, and deploy any framework, any cloud.
pip install kitaru Your agents work. But they're trapped.
They die when your laptop sleeps
One crash, one timeout, or one closed lid and hours of LLM calls vanish. You start over from scratch.
They can't coordinate with other agents
No pause, no resume, no human-in-the-loop. Every agent is a fire-and-forget black box.
Their intermediate results vanish when they crash
No checkpoints, no artifacts, no audit trail. When something goes wrong you have nothing to debug.
Kitaru moves your agents from your machine to your infrastructure.
Eight primitives. Full durability.
import kitaru
from kitaru import flow, checkpoint
kitaru.configure(stack="kubernetes")
@checkpoint
def research(topic: str) -> dict:
results = search_web(topic)
kitaru.save("sources", results)
return summarize(results)
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
prior = kitaru.load(prev_id, "sources")
return kitaru.llm(
"Draft a report on: " + context,
model="gpt-4o",
)
@flow
def report_agent(topic: str, prev_id: str) -> str:
data = research(topic)
draft = write_draft(str(data), prev_id)
kitaru.log(topic=topic, words=len(draft.split()))
approved = kitaru.wait(
schema=bool, question="Publish?"
)
if approved:
publish(draft)
return draft @flow Top-level orchestration boundary. Marks a function as a durable workflow.
@checkpoint Persists output. Crash at step 3? Steps 1-2 never re-run.
kitaru.wait() Suspends the process. Resume when a human responds, 30s or 3 days later.
kitaru.llm() Resolves model alias, injects API key, logs cost automatically.
kitaru.log() Structured metadata on every execution. Query it in the dashboard.
kitaru.save() Persist any artifact by name inside a checkpoint.
kitaru.load() Retrieve saved artifacts from any prior execution by ID.
kitaru.configure() Set stack, project, and runtime defaults. Zero config locally.
Four things. Not forty.
Everything your agents need to ship. Nothing extra to learn.
Pause. Get input. Continue.
Suspends at decision points, releases compute, and picks up when input arrives from a human, another agent, or a webhook. Hours or days later.
Crash at step 6? Resume from step 6.
Every step is checkpointed. Fix the issue and pick up where you left off; no re-burning tokens.
PydanticAI. CrewAI. Whatever.
Wrap it with Kitaru and it becomes durable. No lock-in. No opinions. Your framework, our infrastructure.
Fan out. Each branch checkpoints independently.
checkpoint.submit() dispatches branches concurrently. Each has its own checkpoint history. Replay just the failed branch.
Why teams choose Kitaru
Observability built-in
Free user interface ships with the server. See every checkpoint, every LLM call, and what each step cost. Not a paid add-on.
Full execution control
Pause at decision points, get human input, and resume hours or days later. Replay from any checkpoint without re-running everything.
Deployment flexibility
Same Python code runs locally, on Kubernetes, or across AWS, GCP, and Azure. Switch stacks with one command.
Python-first, no lock-in
Wrap any Python agent framework: PydanticAI, CrewAI, or raw code, with Kitaru decorators. No DSL to learn and no vendor lock-in.
Orchestration layer.
Not a framework.
import kitaru
from kitaru import flow, checkpoint
@flow
def coding_agent(issue: str) -> str:
plan = analyze_issue(issue)
patch = write_code(plan)
# Pauses. Resumes when input arrives.
approved = kitaru.wait(
bool, question="Merge this PR?"
)
if approved:
merge(patch)
return patch Your agent crashed at step 5.
Stop re-running steps 1 through 4.
pip install kitaru Open source (Apache 2.0). pip install and go.
Why teams trust Kitaru
Built on the same ZenML infrastructure running production ML at scale for 5+ years.