CI: shard the AMD case-optimization pre-build#1582
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR addresses CI timeouts in the Frontier AMD case-optimization pipeline by sharding the slow AMD flang pre-build step into two concurrent SLURM submissions.
Changes:
- Update the Frontier AMD pre-build workflow step to submit two shards concurrently and wait for both to complete.
- Add shard-aware case selection to the prebuild script, plus coordination to avoid concurrent builds in shared staging directories.
- Adjust log printing and artifact collection to include per-shard output files.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/workflows/test.yml | Runs a one-time clean, submits two concurrent pre-build shards for Frontier AMD, and archives shard-specific logs. |
| .github/scripts/prebuild-case-optimization.sh | Adds optional i/N sharding and a marker-based synchronization scheme for shared build targets. |
| shard="${job_shard:-}" | ||
| if [ -n "$shard" ]; then | ||
| shard_idx="${shard%%/*}" | ||
| shard_count="${shard##*/}" | ||
| case "${shard_idx}${shard_count}" in | ||
| ''|*[!0-9]*) echo "ERROR: bad shard '$shard' (expected i/N)"; exit 1 ;; | ||
| esac | ||
| if [ "$shard" != "$shard_idx/$shard_count" ] || [ "$shard_idx" -lt 1 ] || [ "$shard_idx" -gt "$shard_count" ]; then | ||
| echo "ERROR: bad shard '$shard' (expected i/N with 1 <= i <= N)"; exit 1 | ||
| fi | ||
| fi |
| if [ -n "$shard" ] && [ "$shard_count" -gt 1 ]; then | ||
| shared_marker="build/.prebuild-shared-targets-done" | ||
| set -- benchmarks/*/case.py | ||
| first_case="$1" | ||
| if [ "$shard_idx" -eq 1 ]; then | ||
| echo "=== Shard 1/$shard_count: building shared targets ===" | ||
| ./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8 | ||
| touch "$shared_marker" | ||
| else | ||
| echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ===" | ||
| waited=0 | ||
| until [ -f "$shared_marker" ]; do | ||
| if [ "$waited" -ge 5400 ]; then | ||
| echo "ERROR: timed out waiting for $shared_marker"; exit 1 | ||
| fi | ||
| sleep 30 | ||
| waited=$((waited + 30)) | ||
| done | ||
| fi | ||
| fi |
| echo "=== Shard 1/$shard_count: building shared targets ===" | ||
| ./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8 | ||
| touch "$shared_marker" | ||
| else | ||
| echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ===" | ||
| waited=0 | ||
| until [ -f "$shared_marker" ]; do | ||
| if [ "$waited" -ge 5400 ]; then | ||
| echo "ERROR: timed out waiting for $shared_marker"; exit 1 | ||
| fi | ||
| sleep 30 | ||
| waited=$((waited + 30)) | ||
| done |
Four non-AMD Frontier jobs (CCE gpu-omp x2, gpu-acc, cpu) died uniformly at ~51-55 min in yesterday's run; login-node build contention makes the 60-minute budget too tight on bad days.
|
Added: the Frontier dependency-build step timeout is doubled to 120 minutes — four non-AMD Frontier jobs (including cpu) died uniformly at ~51-55 minutes in run 27371326575, consistent with login-node contention rather than code. With the prebuild sharding, this makes #1582 the complete Frontier-timeout fix. |
It ran the full suite in one 1:59-walltime SLURM job (job 80982103050 died at the limit); the AMD gpu jobs were already sharded 2-way - the cpu job now matches. The job name gains the [i/2] suffix; if any branch-protection required-check pins the old unsharded name, update it.
|
Third fix added: the Frontier AMD cpu test job ran the full suite in a single 1:59-walltime SLURM job (run 27398258302 died at the limit in Test) while the AMD gpu jobs were already sharded — it now gets the same 2-way sharding. Note the check name changes to include [1/2]/[2/2]; adjust branch protection if it pinned the old name. |
Copilot review fixes: shard parts validated independently and the full i/N shape enforced (1/, /2, and bare 12 now rejected); shard 1 clears both markers at start so stale state from reruns cannot skip the wait; a failure marker written via ERR trap lets waiting shards fail fast instead of burning the 90-minute timeout.
|
All three review findings addressed: shard validation now checks each part independently and enforces the full i/N shape ( |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1582 +/- ##
=======================================
Coverage 60.94% 60.94%
=======================================
Files 82 82
Lines 19922 19922
Branches 2924 2924
=======================================
Hits 12141 12141
Misses 5805 5805
Partials 1976 1976 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Problem
The "Case Opt | Frontier (AMD) (gpu-omp)" job times out. Evidence: job run 80982103009 ran 2h05m and died in the Pre-Build (SLURM) step —
prebuild-case-optimization.shbuilds every case-optimized benchmark variant serially in a single SLURM job under AMD flang (the slowest compiler of the set), and the serial build alone now exceeds the job's walltime.Fix
Shard the AMD pre-build across two concurrent SLURM jobs:
prebuild-case-optimization.shnow honors an optional shard spec ($job_shard, formati/N, already plumbed throughsubmit-slurm-job.sh's 5th argument): shardibuilds every Nth case of the sorted benchmark list (index mod N == i-1). The two shards are deterministic, disjoint, and together cover all cases.test.ymlsubmits shards1/2and2/2as background processes and waits on both, failing if either fails. Output/job-id files get unique-1-of-2/-2-of-2suffixes (existingsubmit-slurm-job.shbehavior); the Print Logs and Archive Logs steps pick them up.Two concurrency details, since both shards share one workspace:
syscheck/pre_process/post_processhash identically across these benchmarks. Shard 1 builds those shared targets first and drops a marker file; other shards wait for it, after which their builds no-op in the shared dirs. This avoids two cmake/ninja invocations ever running concurrently in the same staging directory.Unchanged
run_case_optimization.sh— so it is left alone.Validation
bash -n+ shellcheck clean on both scripts;yaml.safe_loadand./mfc.sh precheckpass.1/2-> {5eq_rk3_weno3_hllc, ibm, viscous_weno5_sgb_acoustic},2/2-> {hypo_hll, igr}; union covers all 5 cases with zero overlap; malformed shard specs (0/2,3/2,a/b,12) are rejected.