← Blog
Agents 4 min read

The Anatomy of a Production Coding Agent

The Anatomy of a Production Coding Agent

The gap between a demo coding agent and a production one isn’t features. It’s failure handling, cost awareness, and the ability to stop, inspect, and resume at any point.

Here’s the full pattern: the eight stages a coding agent goes through from issue to pull request, and what each stage actually requires from your infrastructure.

The eight stages

1. Issue analysis

The agent reads the issue, understands the codebase context, and formulates a plan.

@checkpoint
def analyze_issue(issue: str, codebase_context: str) -> AnalysisPlan:
    return call_llm(
        f"Analyze this issue and create a plan:\n{issue}\n\nContext:\n{codebase_context}"
    )

Why this is a checkpoint: This LLM call costs $0.50-$2.00 depending on context size, if the agent crashes later, you don’t want to re-run this.

Failure mode: The LLM misunderstands the issue. Without a checkpoint, you’d never know and the agent could barrel ahead with a wrong plan.

2. Codebase exploration

The agent navigates the repo, it finds relevant files, understanding architecture, tracing dependencies.

@checkpoint
def explore_codebase(plan: AnalysisPlan) -> ExplorationResult:
    files = search_codebase(plan.relevant_patterns)
    dependencies = trace_imports(files)
    return ExplorationResult(files=files, dependencies=dependencies)

Why this is a checkpoint: Exploration can involve dozens of file reads and searches. Caching this means replays skip straight to the result.

Failure mode: The agent misses a critical file. With a cached exploration result, you can inspect what it found and what it missed.

3. Plan formulation

The agent creates a concrete implementation plan that has steps for: which files to change, in what order, with what approach.

@checkpoint
def create_plan(analysis: AnalysisPlan, exploration: ExplorationResult) -> ImplementationPlan:
    return call_llm(
        f"Create a detailed implementation plan:\n{analysis}\n\nCodebase:\n{exploration}"
    )

Why this is a checkpoint: The plan is the highest-leverage artifact. If the implementation goes wrong, you want to inspect the plan, not re-derive it.

4. Human approval

This is where most demo agents skip straight to writing code. Production agents don’t.

approved = kitaru.wait(
    schema=bool,
    question=f"Agent proposes: {plan.summary}. Approve?",
    metadata={"estimated_cost": plan.estimated_tokens},
)
if not approved:
    return "Plan rejected by reviewer."

Why this matters: The agent is about to spend $5-$20 writing and testing code. A 30-second human review of the plan saves hours of wasted compute. This isn’t a nice-to-have; it’s the difference between a useful tool and an expensive liability.

Infrastructure requirement: Real suspension. The process should stop, release compute, and resume when the human responds, whether that’s 30 seconds or 3 days later.

5. Code generation

The agent writes code according to the approved plan.

@checkpoint
def write_code(plan: ImplementationPlan) -> CodeChanges:
    changes = []
    for file_change in plan.changes:
        code = call_llm(f"Write this change:\n{file_change}")
        changes.append(code)
    return CodeChanges(changes=changes)

Why this is a checkpoint: This is usually the most expensive step because multiple LLM calls are involved, each generating code. Caching this is critical for cost control.

Failure mode: The generated code has a bug. With a checkpoint, you can replay from this point with a different prompt or model, keeping all the previous work.

6. Testing

The agent runs the test suite and interprets the results.

@checkpoint
def run_tests(changes: CodeChanges) -> TestResult:
    applied = apply_changes(changes)
    result = run_test_suite(applied)
    return result

Failure mode: Tests fail. The agent now has a decision to make: fix the code or escalate. This is where most demo agents fall apart. A production agent with checkpointed state can loop back to step 5 with the test failure context, or escalate to a human with the full execution history.

7. Fix loop

If tests fail, the agent iterates. This is the part you absolutely cannot draw as a DAG.

for attempt in range(max_attempts):
    if test_result.passed:
        break
    code = fix_code(code, test_result)
    test_result = run_tests(code)

Each iteration through this loop is checkpointed. If the agent crashes on attempt 3, it resumes from attempt 3; not from scratch.

8. PR creation and review

The agent packages the changes, writes a description, and creates a pull request.

@checkpoint
def create_pr(changes: CodeChanges, plan: ImplementationPlan) -> str:
    description = call_llm(f"Write a PR description for:\n{plan.summary}")
    pr_url = create_pull_request(changes, description)
    return pr_url

What the full flow looks like

Putting it all together:

@flow
def coding_agent(issue: str) -> str:
    context = load_codebase_context()
    analysis = analyze_issue(issue, context)
    exploration = explore_codebase(analysis)
    plan = create_plan(analysis, exploration)

    approved = kitaru.wait(
        schema=bool,
        question=f"Approve plan: {plan.summary}?",
    )
    if not approved:
        return "Rejected."

    code = write_code(plan)
    test_result = run_tests(code)

    for attempt in range(3):
        if test_result.passed:
            break
        code = fix_code(code, test_result)
        test_result = run_tests(code)

    if not test_result.passed:
        return "Tests still failing after 3 attempts."

    return create_pr(code, plan)

That’s normal Python. There’s no graph, no YAML, and no special control flow. But every meaningful step is checkpointed, there’s a human approval gate, and the whole thing survives crashes.

What production requires that demos don’t

Looking at this pattern, the requirements become clear:

Checkpoint caching: Each stage can cost $0.50-$5 in LLM tokens. Re-running from scratch on every failure is unacceptable.

Real suspension: The human approval step needs actual process suspension and not a polling loop burning compute while someone reviews a plan over lunch.

Artifact inspection: When something goes wrong at step 6, you need to look at what step 3 produced. Not a log line; the actual plan object.

Replay from any point: The fix loop failed after 3 attempts? Go back to step 5, swap in a different model, and try again. Don’t re-run the whole agent.

Cost tracking: How much did this run cost? Which step was most expensive? Is attempt 3 of the fix loop worth it, or should the agent escalate?

These aren’t nice-to-haves. They’re the difference between a tool your team trusts and one they abandon after the first $200 surprise.

Continue in the docs

Start building durable agents

Install Kitaru in seconds and make your AI workflows reliable, replayable, and observable.