← Blog
Infrastructure 2 min read

Why Your AI Agents Need Durable Execution

Why Your AI Agents Need Durable Execution

Your agent just spent 45 minutes and $30 in API calls researching a topic. It made it through 23 tool calls, synthesized data from 8 sources, and was about to write the final summary.

Then the OpenAI API returned a 429. Your agent crashed. All that work just gone.

The problem with stateless agents

Most agent frameworks treat each run as stateless. When something fails, you start over. There’s no concept of “pick up where I left off.”

This is fine for short tasks like a single LLM call that fails? Retry it. But agents are getting longer and more complex:

  • Research agents that run for hours, making dozens of API calls
  • Coding agents that iterate through multiple rounds of write-test-fix
  • Multi-step workflows with human approval gates in the middle

For these, starting over is not just annoying; it’s wasteful. You’re re-burning tokens, re-calling APIs, re-doing work that already succeeded.

What durable execution gives you

Durable execution means every meaningful step in your agent is checkpointed. If it crashes at step 15, you restart from step 15, not step 1.

from kitaru import checkpoint, flow

@checkpoint
def step_one(input: str) -> str:
    # This result is persisted
    return expensive_api_call(input)

@checkpoint
def step_two(data: str) -> str:
    # If this crashes, step_one's result is still cached
    return another_expensive_call(data)

@flow
def my_agent(input: str) -> str:
    a = step_one(input)
    b = step_two(a)
    return b

When you replay after a failure, step_one returns its cached output instantly. Only step_two re-executes.

Beyond crash recovery

Once you have checkpointed execution, other things become possible:

Human-in-the-loop: Your agent can pause mid-execution, wait for human input, and resume later. Not with a polling loop burning compute, with actual suspension. The process stops. When input arrives, it restarts from the checkpoint.

Replay with overrides: Your agent made a bad decision at step 5? Go back, swap in a different input, and replay from there. All subsequent steps re-execute with the new data.

Artifact inspection: Every checkpoint output is a versioned artifact you can inspect, compare across runs, and query programmatically.

Why not just use try/except?

You can add retry logic and state persistence yourself. But you’ll end up rebuilding:

  • Serialization for every intermediate result
  • A storage layer for checkpoints
  • Replay logic that skips completed steps
  • A way to inspect and compare artifacts across runs

That’s infrastructure work, not agent work. Kitaru handles it with decorators so you can focus on what your agent actually does.

Continue in the docs

Start building durable agents

Install Kitaru in seconds and make your AI workflows reliable, replayable, and observable.