Skip to content

CI: shard the AMD case-optimization pre-build#1582

Open
sbryngelson wants to merge 4 commits into
MFlowCode:masterfrom
sbryngelson:ci-shard-caseopt-prebuild
Open

CI: shard the AMD case-optimization pre-build#1582
sbryngelson wants to merge 4 commits into
MFlowCode:masterfrom
sbryngelson:ci-shard-caseopt-prebuild

Conversation

@sbryngelson

Copy link
Copy Markdown
Member

Problem

The "Case Opt | Frontier (AMD) (gpu-omp)" job times out. Evidence: job run 80982103009 ran 2h05m and died in the Pre-Build (SLURM) step — prebuild-case-optimization.sh builds every case-optimized benchmark variant serially in a single SLURM job under AMD flang (the slowest compiler of the set), and the serial build alone now exceeds the job's walltime.

Fix

Shard the AMD pre-build across two concurrent SLURM jobs:

  • prebuild-case-optimization.sh now honors an optional shard spec ($job_shard, format i/N, already plumbed through submit-slurm-job.sh's 5th argument): shard i builds every Nth case of the sorted benchmark list (index mod N == i-1). The two shards are deterministic, disjoint, and together cover all cases.
  • The frontier_amd Pre-Build step in test.yml submits shards 1/2 and 2/2 as background processes and waits on both, failing if either fails. Output/job-id files get unique -1-of-2/-2-of-2 suffixes (existing submit-slurm-job.sh behavior); the Print Logs and Archive Logs steps pick them up.

Two concurrency details, since both shards share one workspace:

  • The in-job staging clean is skipped when sharded (it would wipe the sibling shard's in-progress build); the workflow runs the clean once on the login node before submitting.
  • The case-optimized simulation builds land in per-case hash-named staging dirs (no collision), but syscheck/pre_process/post_process hash identically across these benchmarks. Shard 1 builds those shared targets first and drops a marker file; other shards wait for it, after which their builds no-op in the shared dirs. This avoids two cmake/ninja invocations ever running concurrently in the same staging directory.

Unchanged

  • No shard set = exactly the old behavior (all cases, in-job clean), so the Phoenix pre-build path and any direct invocation are unaffected.
  • The Frontier (non-AMD) path doesn't use the pre-build mechanism at all — it builds inside the GPU run job via run_case_optimization.sh — so it is left alone.

Validation

  • bash -n + shellcheck clean on both scripts; yaml.safe_load and ./mfc.sh precheck pass.
  • Shard partition logic reproduced in a standalone shell test against the real benchmark list: 1/2 -> {5eq_rk3_weno3_hllc, ibm, viscous_weno5_sgb_acoustic}, 2/2 -> {hypo_hll, igr}; union covers all 5 cases with zero overlap; malformed shard specs (0/2, 3/2, a/b, 12) are rejected.

Copilot AI review requested due to automatic review settings June 12, 2026 12:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses CI timeouts in the Frontier AMD case-optimization pipeline by sharding the slow AMD flang pre-build step into two concurrent SLURM submissions.

Changes:

  • Update the Frontier AMD pre-build workflow step to submit two shards concurrently and wait for both to complete.
  • Add shard-aware case selection to the prebuild script, plus coordination to avoid concurrent builds in shared staging directories.
  • Adjust log printing and artifact collection to include per-shard output files.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/test.yml Runs a one-time clean, submits two concurrent pre-build shards for Frontier AMD, and archives shard-specific logs.
.github/scripts/prebuild-case-optimization.sh Adds optional i/N sharding and a marker-based synchronization scheme for shared build targets.

Comment on lines +28 to +38
shard="${job_shard:-}"
if [ -n "$shard" ]; then
shard_idx="${shard%%/*}"
shard_count="${shard##*/}"
case "${shard_idx}${shard_count}" in
''|*[!0-9]*) echo "ERROR: bad shard '$shard' (expected i/N)"; exit 1 ;;
esac
if [ "$shard" != "$shard_idx/$shard_count" ] || [ "$shard_idx" -lt 1 ] || [ "$shard_idx" -gt "$shard_count" ]; then
echo "ERROR: bad shard '$shard' (expected i/N with 1 <= i <= N)"; exit 1
fi
fi
Comment on lines +66 to +85
if [ -n "$shard" ] && [ "$shard_count" -gt 1 ]; then
shared_marker="build/.prebuild-shared-targets-done"
set -- benchmarks/*/case.py
first_case="$1"
if [ "$shard_idx" -eq 1 ]; then
echo "=== Shard 1/$shard_count: building shared targets ==="
./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8
touch "$shared_marker"
else
echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ==="
waited=0
until [ -f "$shared_marker" ]; do
if [ "$waited" -ge 5400 ]; then
echo "ERROR: timed out waiting for $shared_marker"; exit 1
fi
sleep 30
waited=$((waited + 30))
done
fi
fi
Comment on lines +71 to +83
echo "=== Shard 1/$shard_count: building shared targets ==="
./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8
touch "$shared_marker"
else
echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ==="
waited=0
until [ -f "$shared_marker" ]; do
if [ "$waited" -ge 5400 ]; then
echo "ERROR: timed out waiting for $shared_marker"; exit 1
fi
sleep 30
waited=$((waited + 30))
done
Four non-AMD Frontier jobs (CCE gpu-omp x2, gpu-acc, cpu) died uniformly at ~51-55 min in yesterday's run; login-node build contention makes the 60-minute budget too tight on bad days.
@sbryngelson

Copy link
Copy Markdown
Member Author

Added: the Frontier dependency-build step timeout is doubled to 120 minutes — four non-AMD Frontier jobs (including cpu) died uniformly at ~51-55 minutes in run 27371326575, consistent with login-node contention rather than code. With the prebuild sharding, this makes #1582 the complete Frontier-timeout fix.

It ran the full suite in one 1:59-walltime SLURM job (job 80982103050 died at the limit); the AMD gpu jobs were already sharded 2-way - the cpu job now matches. The job name gains the [i/2] suffix; if any branch-protection required-check pins the old unsharded name, update it.
@sbryngelson

Copy link
Copy Markdown
Member Author

Third fix added: the Frontier AMD cpu test job ran the full suite in a single 1:59-walltime SLURM job (run 27398258302 died at the limit in Test) while the AMD gpu jobs were already sharded — it now gets the same 2-way sharding. Note the check name changes to include [1/2]/[2/2]; adjust branch protection if it pinned the old name.

Copilot review fixes: shard parts validated independently and the full i/N shape enforced (1/, /2, and bare 12 now rejected); shard 1 clears both markers at start so stale state from reruns cannot skip the wait; a failure marker written via ERR trap lets waiting shards fail fast instead of burning the 90-minute timeout.
@sbryngelson

Copy link
Copy Markdown
Member Author

All three review findings addressed: shard validation now checks each part independently and enforces the full i/N shape (1/, /2, and slashless forms rejected); shard 1 clears both coordination markers at its own start, so stale state from reruns or manual invocations cannot skip the wait; and a failure marker written via ERR trap makes waiting shards fail fast with a pointer to shard 1's log instead of waiting out the 90-minute timeout. 22 standalone test cases over the validation and marker protocol, bash -n and shellcheck clean.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.94%. Comparing base (ac5774e) to head (32b22ad).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1582   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files          82       82           
  Lines       19922    19922           
  Branches     2924     2924           
=======================================
  Hits        12141    12141           
  Misses       5805     5805           
  Partials     1976     1976           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants