Enable parallel, resumable experiment runs#59
Merged
Conversation
Three changes so the ifc-proplang-rocq* variants (and any test set) can run as
many parallel dispatches whose results all persist, even across job kills:
1. Per-test concurrency group. `group: run-experiment-${{ github.ref }}` with
cancel-in-progress meant dispatching any test cancelled an in-flight run for
a *different* test on the same branch. Fold inputs.tests into the group so
distinct tests don't cancel each other; re-dispatching the same test still
supersedes its own prior run.
2. Rebase-retry on the result push. Parallel jobs commit disjoint store files
from the same base commit, so a bare `git push` is rejected non-fast-forward
for all but the first. Since the files are disjoint, rebasing onto the
updated branch never conflicts — fetch + rebase + retry lets every job land
its own results.
3. Commit partial results on kill. The commit step was `success()`-only, so a
job hit by the timeout never persisted progress (only the artifact). Make it
always(), guard on the store existing, tag interrupted stores
`[partial: <status>]`, and drop the job timeout 360 -> 350 so it self-cancels
below GitHub's hard 6h kill with headroom to push. store.push appends+flushes
per trial, so the committed partial store lets a re-run resume via the
incremental per-trial dedup instead of starting over.
Validated locally: a two-job disjoint-file push race rejects the second bare
push and both land cleanly via the rebase-retry loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #57/#58. Makes it possible to run many experiment dispatches (e.g. the 21
ifc-proplang-rocq-*variants) in parallel and have every result persist — even when a job is killed at the timeout.Changes
Per-test concurrency group.
group: run-experiment-${{ github.ref }}+cancel-in-progressmeant dispatching any test cancelled an in-flight run for a different test on the same branch (only the latest survived). Foldinputs.testsinto the group key so distinct tests run concurrently; re-dispatching the same test still supersedes its own prior run.Rebase-retry on the result push. Parallel jobs commit disjoint store files from the same base commit, so a bare
git pushis rejected non-fast-forward for all but the first (a ref race, not a content conflict). Because the files are disjoint, rebasing onto the updated branch never conflicts —fetch + rebase + retrylets every job land its results.Commit partial results on kill. The commit step was
success()-only, so a job killed at the timeout never persisted progress (only the always-uploaded artifact). Now:if: ${{ inputs.commit-results && always() }}so it also runs on failure/cancel/timeout,[partial: <status>],timeout-minutes360 → 350 so the job self-cancels below GitHub's hard 6 h kill, leaving headroom for the commit to push.store.pushappends + flushes per trial, so the committed partial store lets a re-run resume via the incremental per-trial dedup rather than restarting.Validation
Local two-job, disjoint-file push race (same base commit):
Confirmed in source that
store.push(store.rs) opens append +flush()es each metric and the driver calls it per trial, so a killed job's committed store holds every completed trial. Thealways()+ timeout-headroom behavior relies on GitHub runningalways()steps within the cancellation grace below the hard 6 h limit.🤖 Generated with Claude Code