From the makers of ZenML Open source — Apache 2.0

You build your agents.
We make them durable.

Open-source durable execution for Python agents. Checkpoint, replay, and deploy any framework, any cloud.

pip install kitaru
Watch the demo
THE PROBLEM

Your agents work. But they're trapped.

Agents die when your laptop sleeps

They die when your laptop sleeps

One crash, one timeout, or one closed lid and hours of LLM calls vanish. You start over from scratch.

Agents can't coordinate

They can't coordinate with other agents

No pause, no resume, no human-in-the-loop. Every agent is a fire-and-forget black box.

Results vanish when agents crash

Their intermediate results vanish when they crash

No checkpoints, no artifacts, no audit trail. When something goes wrong you have nothing to debug.

Kitaru moves your agents from your machine to your infrastructure.

SEE THE CODE

Eight primitives. Full durability.

agent.py
import kitaru
from kitaru import flow, checkpoint
 
kitaru.configure(stack="kubernetes")
 
@checkpoint
def research(topic: str) -> dict:
    results = search_web(topic)
    kitaru.save("sources", results)
    return summarize(results)
 
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
    prior = kitaru.load(prev_id, "sources")
    return kitaru.llm(
        "Draft a report on: " + context,
        model="gpt-4o",
    )
 
@flow
def report_agent(topic: str, prev_id: str) -> str:
    data = research(topic)
    draft = write_draft(str(data), prev_id)
    kitaru.log(topic=topic, words=len(draft.split()))
 
    approved = kitaru.wait(
        schema=bool, question="Publish?"
    )
    if approved:
        publish(draft)
    return draft
@flow

Top-level orchestration boundary. Marks a function as a durable workflow.

@checkpoint

Persists output. Crash at step 3? Steps 1-2 never re-run.

kitaru.wait()

Suspends the process. Resume when a human responds, 30s or 3 days later.

kitaru.llm()

Resolves model alias, injects API key, logs cost automatically.

kitaru.log()

Structured metadata on every execution. Query it in the dashboard.

kitaru.save()

Persist any artifact by name inside a checkpoint.

kitaru.load()

Retrieve saved artifacts from any prior execution by ID.

kitaru.configure()

Set stack, project, and runtime defaults. Zero config locally.

How it works

Four things. Not forty.

Everything your agents need to ship. Nothing extra to learn.

01 — Pause & Resume

Pause. Get input. Continue.

Suspends at decision points, releases compute, and picks up when input arrives from a human, another agent, or a webhook. Hours or days later.

1
2
3
4
5
6
Waiting for input...
02 — Replay from Failure

Crash at step 6? Resume from step 6.

Every step is checkpointed. Fix the issue and pick up where you left off; no re-burning tokens.

1
2
3
4
5
6
7
Running agent...
$12.40 in tokens saved
03 — Bring Your Framework

PydanticAI. CrewAI. Whatever.

Wrap it with Kitaru and it becomes durable. No lock-in. No opinions. Your framework, our infrastructure.

kitaru
PydanticAI
OpenAI SDK
Your code
CrewAI
04 — Parallel Branches

Fan out. Each branch checkpoints independently.

checkpoint.submit() dispatches branches concurrently. Each has its own checkpoint history. Replay just the failed branch.

WHY KITARU

Why teams choose Kitaru

Observability built-in

Observability built-in

Free user interface ships with the server. See every checkpoint, every LLM call, and what each step cost. Not a paid add-on.

Full execution control

Full execution control

Pause at decision points, get human input, and resume hours or days later. Replay from any checkpoint without re-running everything.

Deployment flexibility

Deployment flexibility

Same Python code runs locally, on Kubernetes, or across AWS, GCP, and Azure. Switch stacks with one command.

Python-first, no lock-in

Python-first, no lock-in

Wrap any Python agent framework: PydanticAI, CrewAI, or raw code, with Kitaru decorators. No DSL to learn and no vendor lock-in.

Architecture

Orchestration layer.
Not a framework.

Your Agent Code
PydanticAI OpenAI SDK CrewAI Raw Python
Write agents your way
Kitaru SDK
@flow @checkpoint wait() llm() log() save() load() configure()
Eight primitives. Full durability.
Kitaru Engine
Checkpointer DAG Builder Replay Cost Engine
Powered by ZenML
Infrastructure
Kubernetes AWS / GCP / Azure S3 / GCS SQL Database
Your cloud
agent.py
import kitaru
from kitaru import flow, checkpoint

@flow
def coding_agent(issue: str) -> str:
    plan = analyze_issue(issue)
    patch = write_code(plan)

    # Pauses. Resumes when input arrives.
    approved = kitaru.wait(
        bool, question="Merge this PR?"
    )
    if approved:
        merge(patch)
    return patch

Your agent crashed at step 5.
Stop re-running steps 1 through 4.

pip install kitaru

Open source (Apache 2.0). pip install and go.