The Agent Is a Workflow That Writes Itself

Jump to:

Section

Example code available at AuctorAI/durable_agents.

Durable execution runtimes have become standard infrastructure for frontier agents. They allow long-running programs to survive arbitrary failures — crashes, rolling deploys, network partitions — by recording workflow progress and resuming only the work that remains.

Frontier agent harnesses are converging on two primitives: agents that invoke other agents recursively, and agents that write programs at runtime to drive their own tools (programmatic tool calling, or PTC). Both primitives expose a gap in the underlying runtime: durability guarantees that hold for a flat tool call loop do not automatically compose through recursion and agent-authored control flow.

We extended those guarantees by lowering each primitive into the durable runtime: subagents lower to child workflows, and PTC programs run through a deterministic interpreter that dispatches granular tool call activities. The result is a "recursive language model" that is also just a workflow: a single dispatch loop that runs tools, deep subagent trees, or an agent swarm, all inheriting granular replay, retry, resume, and cancellation from the runtime underneath. We call these durable agents.

Figure 1. The closure property for durable agents. Direct tool calls, PTC-authored tool programs, and subagent invocations all cross the same durable dispatch boundary. Tools lower to activities, subagents lower to child workflows, and each child workflow runs another durable agent with the same recursive structure.

Generalizing the Agent Loop

Two patterns show up in every modern agent harness:

Recursive subagents. An agent invokes another agent, which has its own model, its own tools, and potentially its own subagents. Recursion is general — fan-out can be massive, trees can run deep. The design space covers whether a parent waits for a child or starts it in the background, how parent and child synchronize, and how context flows between them. The practical goal is task decomposition and context specialization.
Programmatic tool calling (PTC). An agent writes a program against an in-distribution frontend[1] that drives its tools. The program loops, branches, and aggregates, externalizing compute the model would otherwise need to do in-context. PTC moves the model from picking values inside a fixed orchestration to writing the orchestration itself. This idea appears under several names in prior work.[2]

The two primitives connect through the tool surface. A subagent is surfaced to the model as a tool, so once the agent has PTC, every subagent primitive — sync delegation, async spawn, gather, cancel — becomes composable inside its programs: in loops, in parallel, against constructed contexts. Without either primitive, you get a flat tool call loop. With subagents, you get recursive task decomposition. With PTC, you get agent-authored control flow. With both, the two compose freely. Generalization is the point.[3]

Making this durable is the hard part. The runtime must push durable boundaries down through subagents and PTC programs, recursively. Run a PTC program as one big sandboxed step and a crash mid-program throws away every tool call inside. Nest subagents inside the parent's execution and you can't resume one without rerunning the whole tree. If a PTC program fans out to five subagents and three finish before a crash, the restart should pick up the remaining two without rerunning the first three.

A Primer on Durable Execution

Durable execution runtimes record the results of a workflow's side effects, then replay against that record to recover progress after failure.[4] A workflow that crashes on step 73 of 100 should resume on a fresh process without losing or reproducing any previously completed work. Several solutions provide this guarantee. We've adopted Temporal[5] at Auctor and use its terminology throughout.

A workflow definition is code that defines a durable process — in our case, an agent loop. A rollout from the agent is a workflow execution. During execution, Temporal stores each workflow's event history: an append-only log of events that represents the workflow's recorded progress.

An activity is a single unit of side-effectful work: a model API request, a sandbox command, a database write. A command is a requested action emitted by workflow code — schedule an activity, start a child workflow execution, create a timer, and so on. Worker processes poll a task queue, run workflow code and activities, and report results back to Temporal.

Figure 2. A high-level map of an agent loop as a workflow execution. Each model call or tool call is scheduled as an activity, with results stored in the event history. Worker processes poll the Temporal service for workflow and activity tasks. We show those handlers separately for visual clarity, though one worker process can handle both. When a worker dies, another worker replays the workflow definition against the recorded history. Completed activities return recorded results, and only unfinished work remains.

The key contract is that each workflow definition must be deterministic: given the same input and event history, the workflow execution must emit the same sequence of commands. During replay, a worker re-runs the workflow definition from the start and matches each emitted command against the recorded event history. If an activity previously completed, its recorded result is returned from history and the activity is not run again. If replay reaches a command that did not complete before the worker died, Temporal schedules the remaining work and the workflow continues as if the crash never happened.

Expressing agent work this way gives long-horizon runs the reliability patterns they need. A transient model-provider outage no longer poisons an hour-long rollout, because the runtime retries that failed activity alone under its retry policy. A workflow can sleep for days on a timer or block on an external signal without using resources while it waits. Different worker pools (model workers, tool workers, sandbox workers) scale independently and can be rolled or auto-scaled against a backlog without dropping in-flight work. Cancellation can cascade through child workflows by relationship. Parent close policies make that relationship explicit: when the parent closes, a child can be terminated, asked to cancel, or abandoned. Each of these properties can be hand-rolled on an in-memory async loop. Composing them — with replay-correct state recovery across worker processes — is what durable execution provides.

Building the Durable Agent

We can now build the durable agent as one dispatch layer with three lowerings. The construction is closed; every operation the model can ask for (direct tool call, subagent spawn, tool call via PTC) passes through a single, recursive switch:

# initialize prompts, tools, etc.
agent = Agent(...)

async def dispatch(call):
    match agent.tools[call.name]:
        case DirectTool(fn=fn):
            return await execute_activity(fn, call.args)
        case SubagentTool() as tool:
            # this subagent may itself dispatch subagents or invoke PTC
            return await run_subagent(tool, call)
        case PtcTool():
            # PTC may dispatch subagents, or even other PTC programs
            return await run_ptc(call.args["program"])

Tool calls run as activities. A single-loop agent maps onto durable execution trivially — the base case. The workflow definition is a while-loop that calls the model, dispatches tool calls, and continues until the model terminates. Each model call and tool call lowers to an activity; parallel tool calls fan out as concurrent activities. Retry and timeout policies attach natively. The model simply decides what happens next, and the runtime makes it happen durably.

@workflow
async def rollout(agent, input):
    history = [input]
    while True:
        # call model provider durably
        response = await execute_activity(call_model, agent.model, history)
        history.append(response)

        # terminate loop
        if response.is_final:
            return response.output

        # execute tool batch
        results = await gather(*(dispatch(call) for call in response.tool_calls))
        history.extend(results)

# run the root-level agent
result = await workflow.run(rollout, agent, input)

Subagents as child workflow executions. A subagent must itself be a durable agent, so spawn needs to start a child workflow execution running rollout — the same agent loop, one level down. Temporal schedules the child, possibly on a different worker process, and the child's lifecycle and result are recorded in the parent's event history. Recursion and fan-out follow from there. Awaiting a child and closing a child are separate concerns: the parent may wait for the result, while the parent close policy defines what happens to the child if the parent closes (terminate, request_cancel, or abandon). Multi-agent orchestration primitives like shared state and message passing sit on top of the same execution shape.

async def run_subagent(tool, call):
    # optionally limit tree depth
    assert agent.depth < MAX_DEPTH, "subagent recursion limit"

    # start child workflow
    return await child_workflow(
        rollout,
        replace(tool.subagent, depth=agent.depth + 1),
        call.args,
        id=deterministic_child_id(agent.run_id, call.id),
        parent_close_policy=tool.parent_close_policy,
    )

PTC as a workflow-native interpreter. Rather than executing run_ptc as a single opaque activity, we want to dispatch each tool call within as its own activity or child workflow — the same one a direct call would have produced. Fan-out and subagents should fold in for free if the runtime already supports them.

To accomplish this, we need a PTC interpreter that satisfies a few properties:

Workflow-space: run the interpreter as workflow code, not inside an activity — each call the program makes becomes a workflow command, visible to the runtime and recorded in the event history.
Deterministic: given the same program, inputs, and event history, the interpreter must reach the same call sites in the same order.
Sandboxed: block ambient I/O and syscalls, so interpreter execution remains deterministic and safe to host in the worker process.
In-distribution: present a language interface to the model that it already knows well.
Suspendible: pause at any call site and resume after the result is recorded.

When the agent calls run_ptc, the interpreter steps through the program. Each call is intercepted: the interpreter suspends, hands the call to dispatch, and resumes with the result. Parallel calls inside the program lower to the workflow execution's own concurrency primitives. N gathered tool calls become N concurrent activities. N gathered spawns become N concurrent child workflows. The interpreter never runs the side effect itself; it only dispatches.

We adopted Monty[6] at Auctor to serve this purpose, but also considered QuickJS as an embeddable JavaScript-compatible surface, as well as a custom DSL[7] approach.

async def run_ptc(program):
    # holds in-memory interpreter state
    interp = Interpreter(program)
    while not interp.done:
        # each step executes the interpreter and
        # suspends at the next foreign function call
        op = interp.step()
        match op:
            case Gather(calls):
                result = await gather(*(dispatch(c) for c in calls))
            case Call(c):
                result = await dispatch(c)
        interp.resume(result)
    return interp.result

PTC completes the closure. run_ptc recurses into dispatch, so every tool call inside walks back through to the same switch. A spawn inside PTC becomes a child workflow, which itself may invoke PTC recursively one level down.

Why not a Python sandbox?

The obvious alternative is to run a Python sandbox with a tool-derived client library inside it, allowing the agent's exec scripts to call back into the harness's tools.[8] The upsides are real: native filesystem, network, and package ecosystem. The costs are also real, because it forces the entire script to collapse into a single activity. Per-call retry, replayable partial progress — gone, exactly where the agent gets interesting.[9]

Figure 3. Same agent program, two lowerings. On the left, the whole script lives behind one exec activity. When bash fails, Temporal retries the only activity it knows about — the script itself — so list_files and write_file can run again after their writes already landed. The event history has no record of those inner successes, so replay can't reconcile or skip them; the writes may be repeated. On the right, each call is its own activity. list_files and write_file are recorded as completed activities, so replay resolves their results from the event history and only bash retries. The hazard is localized, but not eliminated. The failing activity itself still needs to be idempotent on retry. (See Failure Modes.)

Our workflow-native approach doesn't give up the sandbox because it remains available through the tool surface. Sandbox calls in PTC programs keep their own activity boundary:

# example (unsafe) PTC program
now = await sandbox.exec("python -c 'import time; print(time.time())'")
metrics = await db.query(f"SELECT * FROM requests WHERE ts > {now} - 86400")

# PTC stdout is appended to the tool result
print(await sandbox.exec("python /scripts/p99.py", stdin=json.dumps(metrics)))

Three awaits, three activities. The system-time call is non-deterministic, but its result is captured once in the event history. On replay, the recorded timestamp is read from history rather than the wall clock. Same goes for the query and the analysis: each has its own durable boundary. We pay the added cost of moving bytes between the worker process and the sandbox, but expressiveness is preserved.[10]

Failure Modes

Durable execution covers more than crash and replay. It provides a vocabulary for failure: retry policies, timeouts (start-to-close, heartbeat), cancellation, idempotency contracts.

Error routing. We categorize tool errors in three classes:

ToolValidationError: the agent violated the tool's contract (e.g., incorrect arguments, missing artifact reads, exhausted recursion depth). The activity fails without retry. The error surfaces as agent feedback, and the agent decides what to do next.
ToolExecutionError: the tool hit a failure the runtime cannot safely retry on its own (e.g., an upstream provider outage, an unexpected exception). This is the catch-all. The error surfaces in the agent's context, and we route it to a separate monitoring tier.
ToolRetryableError: the operation is idempotent and safe to repeat (e.g., a read hitting a 429, a transient network blip). Raising this opts the activity into its retry policy. Exhausted retries become a ToolExecutionError.

Idempotency and retries. Raising ToolRetryableError is an opt-in idempotency claim — the tool's author asserts the activity is safe to run more than once. Read-only tools satisfy this trivially. Idempotent writes need stable keys derived from identifiers that survive replay: the workflow ID, run ID, and specific tool call. Making every tool perfectly idempotent is hard, especially when coordinating writes across multiple services that can't be wrapped in a single transaction. In our system, most writes don't need any of this. The agent loop is itself a retry mechanism: a ToolExecutionError becomes feedback, the model decides what to do next, and models are unreasonably robust to flaky tools. At-most-once is fine for most things, but retry can avoid paying for another model call.[11]

Error composition. Tool-level errors need to compose cleanly. When a subagent calls run_ptc, which calls a tool, which retries and fails, the terminal failure must surface cleanly back through the chain. Subagents handle their own tool failures internally. If a child workflow fails terminally, the parent sees a single ToolExecutionError. Tools that fail in a PTC program are re-raised from run_ptc. When a parallel tool call fails, in-flight siblings are canceled before the error propagates, to avoid orphan side effects.

Cancellation is separate from error handling. How parent cancellation propagates depends on the child relationship. Linked subagents terminate with the parent; detached subagents carry an explicit parent close policy (terminate, request_cancel, abandon).

Testing closes the loop. The matrix of errors and cancellation across every tool implementation, subagents, and PTC is vast. We parametrize this matrix as test cases and run them against a local WorkflowEnvironment, verifying that the number of activities, workflows, and retries matches expectations. Time skipping makes long sleeps take milliseconds; recorded provider responses[12] let us iterate the matrix without burning model calls.

A Durable Example

The agent is pointed at a repository and asked to analyze source files independently, build a cross-file index, verify the aggregate result, and write the final report. It uses PTC to drive the rollout: sandbox calls for shell work, linked subagents for bounded analysis, and a detached subagent for background indexing.

# enumerate targets
files = sandbox.run("ls src/*.py").splitlines()

# per-file analysis, awaited together
analyses = await gather(*[spawn("analyzer", path=f) for f in files])

# indexer runs in the background while we verify
index = spawn_detached("indexer", root=".")
verdict = await spawn("verifier", findings=analyses)

# join the background work and finalize
index_report = await wait(index)
write_report(verdict, index_report)

Figure 4. A code-analysis rollout. One small program expands at runtime into N activities and M child workflow executions under a single parent workflow execution. Every activity and child workflow is recorded — and since each child is itself a workflow, this extends recursively. If a worker dies mid-rollout, another worker replays the parent workflow against its history, reads completed activity and child workflow results, and continues whatever is still in flight.

Would the agent actually write this code unprompted? Maybe not in this exact shape today — but it can, and when it does, the program runs durably. RLM-style harnesses are advancing quickly,[13] and we believe orchestration patterns that feel ambitious now will be table stakes soon.

In Practice

Our implementation is a thin layer over Pydantic AI[14] and Temporal: no fork, no patching. A developer writes an ordinary Pydantic agent without hand-written workflow definitions or activity wrappers. The library integration discovers model and tool surfaces from the registered agent and lowers them to generated activities. Workflow-native tools (the PTC interpreter, spawn, etc.) are not lowered; they run in workflow-space.

Beyond this, we package the machinery as a generic TaskWorkflow behind a Temporal Nexus endpoint: callers simply provide a small runtime config — model fallback chain, tools, subagents, sandbox reference, gating policies, input messages — and have their tasks run durably. The payoff is that we can ship agents with confidence. Every tool call, every subagent turn, every PTC step is recorded in the same event history — and inherits the same observability, retry policies, cancellation semantics, and worker pools. With durability, recursion, and orchestration handled by the substrate, the interesting work moves up the stack, to what the agents actually do.

1In practice, often Python or JavaScript, because of requisite runtime support (described in Building the Durable Agent) and in-distribution-ness.

2CodeAct (Wang et al., 2024), Cloudflare's Code Mode, and Anthropic's programmatic tool calling. An older line of work looks at treating code as the action surface for a language model: PAL: Program-Aided Language Models (Gao et al., 2022) and Program of Thoughts Prompting (Chen et al., 2022).

3This composition is essentially the recursive language model (Zhang & Khattab, 2025). RLM enforces a few design decisions — a REPL-centric Python frontend, subagents awaited synchronously, and context passed as strings — but the core idea generalizes beyond them.

4Workers often cache workflow state in memory and preferentially route subsequent tasks back to the same worker. In Temporal terms, this optimization is called Sticky Execution.

5For deeper architectural treatment see Temporal's documentation. Other durable runtimes exist (Inngest, Restate, DBOS, Hatchet). They share the core shape.

6Monty is a minimal Python interpreter from Pydantic that does a lot more than what we use it for here. We mostly cared about Python's in-distribution properties and Monty's exceptionally low host overhead.

7Making this work is a weight/prompt-space bet. Because we don't control the model's training distribution and weights, we look to control the language interface.

8Whether the agent itself runs inside the sandbox or outside it does not change the picture. Either way the entire script lives behind one activity as far as the runtime is concerned.

9While it's still possible to hand-roll retry/replay properties, why not just do it with Temporal?

10Monty also supports filesystem mount, so the agent could instead write pathlib-style code against a virtual filesystem without explicit tool calling.

11Generic sandbox calls are difficult to make strictly idempotent: a shell command can succeed and lose its output, fail and leave half a write, or run again with different effects under retry. We have ideas on how to push this further, which deserve their own post. In practice we can accept approximate idempotency.

12HTTP cassettes (vcrpy and similar) capture provider request/response pairs once, then replay them deterministically against the same workflow code in subsequent test runs.

13See Recursive Agent Optimization (Gandhi et al., 2025) and The Mismanaged Geniuses Hypothesis.

14We recently chose Pydantic AI because of its mature Temporal integration.