From the makers of ZenML Open source — Apache 2.0

You build your agents.
We make them durable.

Open-source durable execution for Python agents. Checkpoint, replay, and deploy any framework, any cloud.

pip install kitaru

View on GitHub Read the docs

Watch the demo

THE PROBLEM

Your agents work. But they're trapped.

They die when your laptop sleeps

One crash, one timeout, or one closed lid and hours of LLM calls vanish. You start over from scratch.

They can't coordinate with other agents

No pause, no resume, no human-in-the-loop. Every agent is a fire-and-forget black box.

Their intermediate results vanish when they crash

No checkpoints, no artifacts, no audit trail. When something goes wrong you have nothing to debug.

Kitaru moves your agents from your machine to your infrastructure.

SEE THE CODE

Eight primitives. Full durability.

agent.py

import kitaru
from kitaru import flow, checkpoint
 
kitaru.configure(stack="kubernetes")
 
@checkpoint
def research(topic: str) -> dict:
    results = search_web(topic)
    kitaru.save("sources", results)
    return summarize(results)
 
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
    prior = kitaru.load(prev_id, "sources")
    return kitaru.llm(
        "Draft a report on: " + context,
        model="gpt-4o",
    )
 
@flow
def report_agent(topic: str, prev_id: str) -> str:
    data = research(topic)
    draft = write_draft(str(data), prev_id)
    kitaru.log(topic=topic, words=len(draft.split()))
 
    approved = kitaru.wait(
        schema=bool, question="Publish?"
    )
    if approved:
        publish(draft)
    return draft

@flow

Top-level orchestration boundary. Marks a function as a durable workflow.

@checkpoint

Persists output. Crash at step 3? Steps 1-2 never re-run.

kitaru.wait()

Suspends the process. Resume when a human responds, 30s or 3 days later.

kitaru.llm()

Resolves model alias, injects API key, logs cost automatically.

kitaru.log()

Structured metadata on every execution. Query it in the dashboard.

kitaru.save()

Persist any artifact by name inside a checkpoint.

kitaru.load()

Retrieve saved artifacts from any prior execution by ID.

kitaru.configure()

Set stack, project, and runtime defaults. Zero config locally.

How it works

Four things. Not forty.

Everything your agents need to ship. Nothing extra to learn.

01 — Pause & Resume

Pause. Get input. Continue.

Suspends at decision points, releases compute, and picks up when input arrives from a human, another agent, or a webhook. Hours or days later.

Waiting for input...

02 — Replay from Failure

Crash at step 6? Resume from step 6.

Every step is checkpointed. Fix the issue and pick up where you left off; no re-burning tokens.

Running agent...

$12.40 in tokens saved

03 — Bring Your Framework

PydanticAI. CrewAI. Whatever.

Wrap it with Kitaru and it becomes durable. No lock-in. No opinions. Your framework, our infrastructure.

kitaru

PydanticAI

OpenAI SDK

Your code

CrewAI

04 — Parallel Branches

Fan out. Each branch checkpoints independently.

checkpoint.submit() dispatches branches concurrently. Each has its own checkpoint history. Replay just the failed branch.

WHY KITARU

Why teams choose Kitaru

Observability built-in

Free user interface ships with the server. See every checkpoint, every LLM call, and what each step cost. Not a paid add-on.

Full execution control

Pause at decision points, get human input, and resume hours or days later. Replay from any checkpoint without re-running everything.

Deployment flexibility

Same Python code runs locally, on Kubernetes, or across AWS, GCP, and Azure. Switch stacks with one command.

Python-first, no lock-in

Wrap any Python agent framework: PydanticAI, CrewAI, or raw code, with Kitaru decorators. No DSL to learn and no vendor lock-in.

Architecture

Orchestration layer.
Not a framework.

Your Agent Code

PydanticAI OpenAI SDK CrewAI Raw Python

Write agents your way

Kitaru SDK

@flow @checkpoint wait() llm() log() save() load() configure()

Eight primitives. Full durability.

Kitaru Engine

Checkpointer DAG Builder Replay Cost Engine

Infrastructure

Kubernetes AWS / GCP / Azure S3 / GCS SQL Database

Your cloud

agent.py

import kitaru
from kitaru import flow, checkpoint

@flow
def coding_agent(issue: str) -> str:
    plan = analyze_issue(issue)
    patch = write_code(plan)

    # Pauses. Resumes when input arrives.
    approved = kitaru.wait(
        bool, question="Merge this PR?"
    )
    if approved:
        merge(patch)
    return patch

Your agent crashed at step 5.
Stop re-running steps 1 through 4.

pip install kitaru

Read the docs Star on GitHub

Open source (Apache 2.0). pip install and go.

You build your agents.We make them durable.

Your agents work. But they're trapped.

They die when your laptop sleeps

They can't coordinate with other agents

Their intermediate results vanish when they crash

Eight primitives. Full durability.

Four things. Not forty.

Pause. Get input. Continue.

Crash at step 6? Resume from step 6.

PydanticAI. CrewAI. Whatever.

Fan out. Each branch checkpoints independently.

Why teams choose Kitaru

Observability built-in

Full execution control

Deployment flexibility

Python-first, no lock-in

Orchestration layer.Not a framework.

Why teams trust Kitaru

Your agent crashed at step 5. Stop re-running steps 1 through 4.

You build your agents.
We make them durable.

Orchestration layer.
Not a framework.

Your agent crashed at step 5.
Stop re-running steps 1 through 4.