Skip to content

Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228

Open
Copilot wants to merge 1 commit into
mainfrom
copilot/fix-loop-restart-issue
Open

Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228
Copilot wants to merge 1 commit into
mainfrom
copilot/fix-loop-restart-issue

Conversation

Copilot AI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Two related defects in how df.loop drives continue_as_new:

  1. Non-root loops restarted from the graph root. df.loop called continue_as_new inline in the main orchestration, so every new generation restarted from graph.root_node_id — re-executing prefix nodes on every iteration. For prefix ~> df.loop(body), the prefix ran once per iteration instead of once per instance.

  2. Loops that spawned sub-orchestrations hung forever across generations. Once loops run as child orchestrations (see below), a loop body that itself spawns sub-orchestrations — a nested df.loop, or a JOIN/RACE branch — would stall. Duroxide's auto-generated child instance IDs ({parent}::sub::{event_id}) reset their event counter on each continue_as_new generation, so every loop generation re-derived the same child ID and collided with the previous (now terminal) child. The collided child was acked without processing and the loop never made progress.

Approach

Loops run as child sub-orchestrations

Each df.loop() node spawns a dedicated child sub-orchestration (execute_loop, registered as pg_durable::orchestration::execute-loop). The child handles all iterations via continue_as_new; when the loop exits it returns a SubtreeEnvelope to the parent. The parent orchestration awaits the child and continues with any suffix nodes — so prefix nodes run exactly once.

This is made possible by duroxide PR #31, which preserves the parent link across continue_as_new generations in a child orchestration. The duroxide dependency is switched to a git dependency on the pinodeca/continue-parent-link branch until that PR is merged and released.

// execute_loop_node — spawns child sub-orchestration with an explicit, generation-qualified id
let child_id = child_instance_id(ctx, "loop", node_id);
let raw = ctx.schedule_sub_orchestration_with_id(LOOP_NAME, child_id, loop_input).await?;
parse_subtree_envelope(&raw, "LOOP", results)

Security is preserved: execute_loop calls load_function_graph at the start of every generation (including after continue_as_new), re-validating submitted_by from the database and catching cross-iteration tampering exactly as the parent execute() does.

Generation-qualified child instance IDs

To fix the cross-generation collision, all sub-orchestration spawn sites (loop, join, join3 extras, race) now use schedule_sub_orchestration_with_id with a deterministic, generation-qualified child ID built by a new child_instance_id() helper:

{instance_id}::e{execution_id}::{tag}::{node_id}

execution_id increments on each continue_as_new, so a child spawned in generation N never collides with the terminal child from generation N-1, while the ID stays deterministic across replays of the same generation.

Changes

  • Cargo.toml / Cargo.lock: switches duroxide to a git dependency on microsoft/duroxide branch pinodeca/continue-parent-link; adds [patch.crates-io] so duroxide-pg's transitive dependency resolves to the same version. Revert to crates.io pins once PR [DO NOT MERGE] Add native Duroxide provider, stop using duroxide-pg-opt #31 is merged and released.

  • src/orchestrations/execute_function_graph.rs

    • execute_loop (new public sub-orchestration): runs iterations via continue_as_new; re-calls load_function_graph per generation for security; returns a SubtreeEnvelope on exit (a break inside the body is the loop's own terminator, so the loop always exits with Normal control)
    • execute_loop_node: spawns execute_loop and merges results back via parse_subtree_envelope
    • child_instance_id (new helper) + schedule_sub_orchestration_with_id at the loop, join, and race spawn sites — fixes the cross-generation child ID collision
  • src/registry.rs: registers LOOP_NAME alongside NAME and SUBTREE_NAME

  • USER_GUIDE.md: notes that each df.loop() now runs as its own child orchestration (prefix runs once, suffix runs after the loop exits, and df.instances shows the parent plus the loop's child instance)

  • tests/e2e/sql/24_nonroot_loop.sql: regression tests covering:

    1. prefix ~> loop — prefix runs once, body runs N times
    2. prefix ~> loop ~> suffix — prefix and suffix each run once
    3. Named result from a prefix node accessible inside the loop body across generations
    4. Nested loop (df.loop(... ~> df.loop(...) ...)) — exercises the cross-generation child ID fix
    5. Loop inside a JOIN branch — exercises the same fix for parallel branches
    6. Non-root while-loopprefix ~> loop(body, while-condition) ~> suffix exits via the while condition

Upgrade & Migration

No extension schema changes and no upgrade script changes. The new .so reads the same df schema as before; loop execution is entirely a runtime/orchestration change. The only dependency change is the temporary duroxide git pin noted above.

Copilot AI changed the title [WIP] Fix df.loop continue_as_new restart behavior Fix df.loop continue_as_new restarting from graph root when loop is not root Jun 11, 2026
Copilot AI requested a review from pinodeca June 11, 2026 23:29
@pinodeca pinodeca marked this pull request as ready for review June 11, 2026 23:40

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your statement:

A sub-orchestration approach was evaluated but is blocked by a duroxide 0.1.29
limitation: ContinueAsNew drops the parent link, so the parent orchestration never
receives the child's completion.

I created this PR in duroxide to unblock that approach:
microsoft/duroxide#31

Therefore, re-evaluate and:

  • if the approach is still blocked, explain why and what would have to change to unblock it.
  • if the approach is unblocked, re-implement using the sub-orchestration approach and when you're done update the PR description accordingly.

Copilot AI requested a review from pinodeca June 12, 2026 19:50
@pinodeca pinodeca force-pushed the copilot/fix-loop-restart-issue branch from a593965 to 754f1f2 Compare June 12, 2026 21:06
@pinodeca pinodeca changed the title Fix df.loop continue_as_new restarting from graph root when loop is not root Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision Jun 12, 2026
@pinodeca

Copy link
Copy Markdown
Contributor

If microsoft/duroxide#33 merges and is part of the next release, this PR can revert back to using schedule_sub_orchestration rather than schedule_sub_orchestration_with_id thus avoiding managing an orchestration ID scheme.

Also, note that this PR should solve #230 and #233 (along with #227, which it initially targeted). Let's validate that by adding regression tests for the scenarios described in 230 and 233.

@pinodeca pinodeca mentioned this pull request Jun 17, 2026
8 tasks
pinodeca added a commit that referenced this pull request Jun 27, 2026
Squash of the node-state-model proposal, the execution-id proposal, and the
temporary exec-id implementation plan. Rebased onto the df.loop
sub-orchestration fix (PR #228, copilot/fix-loop-restart-issue).
@pinodeca pinodeca force-pushed the copilot/fix-loop-restart-issue branch from 3a4255f to b26327a Compare June 27, 2026 13:51
pinodeca added a commit that referenced this pull request Jun 28, 2026
Squash of the node-state-model proposal, the execution-id proposal, and the
temporary exec-id implementation plan. Rebased onto the df.loop
sub-orchestration fix (PR #228, copilot/fix-loop-restart-issue).
pinodeca added a commit that referenced this pull request Jun 28, 2026
Squash of the node-state-model proposal, the execution-id proposal, and the
temporary exec-id implementation plan. Rebased onto the df.loop
sub-orchestration fix (PR #228, copilot/fix-loop-restart-issue).
@pinodeca pinodeca force-pushed the copilot/fix-loop-restart-issue branch from b26327a to 3aed223 Compare June 29, 2026 23:36
@microsoft microsoft deleted a comment from Copilot AI Jun 30, 2026
@microsoft microsoft deleted a comment from Copilot AI Jun 30, 2026
…s_new

df.loop called continue_as_new inline in the main orchestration, so every
new generation restarted from graph.root_node_id, re-executing prefix nodes
on every iteration. Each df.loop() node now spawns a dedicated child
sub-orchestration (execute_loop) that owns continue_as_new; the parent awaits
it and runs any suffix nodes exactly once.

Relies on duroxide PR #31 (parent link preserved across continue_as_new),
pulled in as a git dependency until merged and released.

Co-authored-by: copilot-swe-agent <copilot@github.com>
Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
@pinodeca pinodeca force-pushed the copilot/fix-loop-restart-issue branch from 3aed223 to a348916 Compare June 30, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants