Skip to content

Enable parallel, resumable experiment runs#59

Merged
alpaylan merged 1 commit into
mainfrom
parallel-experiment-runs
May 27, 2026
Merged

Enable parallel, resumable experiment runs#59
alpaylan merged 1 commit into
mainfrom
parallel-experiment-runs

Conversation

@alpaylan

Copy link
Copy Markdown
Owner

Follow-up to #57/#58. Makes it possible to run many experiment dispatches (e.g. the 21 ifc-proplang-rocq-* variants) in parallel and have every result persist — even when a job is killed at the timeout.

Changes

  1. Per-test concurrency group. group: run-experiment-${{ github.ref }} + cancel-in-progress meant dispatching any test cancelled an in-flight run for a different test on the same branch (only the latest survived). Fold inputs.tests into the group key so distinct tests run concurrently; re-dispatching the same test still supersedes its own prior run.

  2. Rebase-retry on the result push. Parallel jobs commit disjoint store files from the same base commit, so a bare git push is rejected non-fast-forward for all but the first (a ref race, not a content conflict). Because the files are disjoint, rebasing onto the updated branch never conflicts — fetch + rebase + retry lets every job land its results.

  3. Commit partial results on kill. The commit step was success()-only, so a job killed at the timeout never persisted progress (only the always-uploaded artifact). Now:

    • if: ${{ inputs.commit-results && always() }} so it also runs on failure/cancel/timeout,
    • guards on the store file existing, tags interrupted commits [partial: <status>],
    • job timeout-minutes 360 → 350 so the job self-cancels below GitHub's hard 6 h kill, leaving headroom for the commit to push.
      store.push appends + flushes per trial, so the committed partial store lets a re-run resume via the incremental per-trial dedup rather than restarting.

Validation

Local two-job, disjoint-file push race (same base commit):

1) bare push:   jobA OK ; jobB REJECTED (non-fast-forward)   <- the bug
2) rebase-retry: jobB push after rebase: OK
final remote:   store-A.jsonl=result-from-jobA, store-B.jsonl=result-from-jobB  (clean history, no conflict)

Confirmed in source that store.push (store.rs) opens append + flush()es each metric and the driver calls it per trial, so a killed job's committed store holds every completed trial. The always() + timeout-headroom behavior relies on GitHub running always() steps within the cancellation grace below the hard 6 h limit.

🤖 Generated with Claude Code

Three changes so the ifc-proplang-rocq* variants (and any test set) can run as
many parallel dispatches whose results all persist, even across job kills:

1. Per-test concurrency group. `group: run-experiment-${{ github.ref }}` with
   cancel-in-progress meant dispatching any test cancelled an in-flight run for
   a *different* test on the same branch. Fold inputs.tests into the group so
   distinct tests don't cancel each other; re-dispatching the same test still
   supersedes its own prior run.

2. Rebase-retry on the result push. Parallel jobs commit disjoint store files
   from the same base commit, so a bare `git push` is rejected non-fast-forward
   for all but the first. Since the files are disjoint, rebasing onto the
   updated branch never conflicts — fetch + rebase + retry lets every job land
   its own results.

3. Commit partial results on kill. The commit step was `success()`-only, so a
   job hit by the timeout never persisted progress (only the artifact). Make it
   always(), guard on the store existing, tag interrupted stores
   `[partial: <status>]`, and drop the job timeout 360 -> 350 so it self-cancels
   below GitHub's hard 6h kill with headroom to push. store.push appends+flushes
   per trial, so the committed partial store lets a re-run resume via the
   incremental per-trial dedup instead of starting over.

Validated locally: a two-job disjoint-file push race rejects the second bare
push and both land cleanly via the rebase-retry loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alpaylan alpaylan merged commit 4ade5b7 into main May 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant