CI: shard the AMD case-optimization pre-build by sbryngelson · Pull Request #1582 · MFlowCode/MFC

sbryngelson · 2026-06-12T12:13:51Z

Problem

The "Case Opt | Frontier (AMD) (gpu-omp)" job times out. Evidence: job run 80982103009 ran 2h05m and died in the Pre-Build (SLURM) step — prebuild-case-optimization.sh builds every case-optimized benchmark variant serially in a single SLURM job under AMD flang (the slowest compiler of the set), and the serial build alone now exceeds the job's walltime.

Fix

Shard the AMD pre-build across two concurrent SLURM jobs:

prebuild-case-optimization.sh now honors an optional shard spec ($job_shard, format i/N, already plumbed through submit-slurm-job.sh's 5th argument): shard i builds every Nth case of the sorted benchmark list (index mod N == i-1). The two shards are deterministic, disjoint, and together cover all cases.
The frontier_amd Pre-Build step in test.yml submits shards 1/2 and 2/2 as background processes and waits on both, failing if either fails. Output/job-id files get unique -1-of-2/-2-of-2 suffixes (existing submit-slurm-job.sh behavior); the Print Logs and Archive Logs steps pick them up.

Two concurrency details, since both shards share one workspace:

The in-job staging clean is skipped when sharded (it would wipe the sibling shard's in-progress build); the workflow runs the clean once on the login node before submitting.
The case-optimized simulation builds land in per-case hash-named staging dirs (no collision), but syscheck/pre_process/post_process hash identically across these benchmarks. Shard 1 builds those shared targets first and drops a marker file; other shards wait for it, after which their builds no-op in the shared dirs. This avoids two cmake/ninja invocations ever running concurrently in the same staging directory.

Unchanged

No shard set = exactly the old behavior (all cases, in-job clean), so the Phoenix pre-build path and any direct invocation are unaffected.
The Frontier (non-AMD) path doesn't use the pre-build mechanism at all — it builds inside the GPU run job via run_case_optimization.sh — so it is left alone.

Validation

bash -n + shellcheck clean on both scripts; yaml.safe_load and ./mfc.sh precheck pass.
Shard partition logic reproduced in a standalone shell test against the real benchmark list: 1/2 -> {5eq_rk3_weno3_hllc, ibm, viscous_weno5_sgb_acoustic}, 2/2 -> {hypo_hll, igr}; union covers all 5 cases with zero overlap; malformed shard specs (0/2, 3/2, a/b, 12) are rejected.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses CI timeouts in the Frontier AMD case-optimization pipeline by sharding the slow AMD flang pre-build step into two concurrent SLURM submissions.

Changes:

Update the Frontier AMD pre-build workflow step to submit two shards concurrently and wait for both to complete.
Add shard-aware case selection to the prebuild script, plus coordination to avoid concurrent builds in shared staging directories.
Adjust log printing and artifact collection to include per-shard output files.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
.github/workflows/test.yml	Runs a one-time clean, submits two concurrent pre-build shards for Frontier AMD, and archives shard-specific logs.
.github/scripts/prebuild-case-optimization.sh	Adds optional `i/N` sharding and a marker-based synchronization scheme for shared build targets.

+shard="${job_shard:-}"
+if [ -n "$shard" ]; then
+    shard_idx="${shard%%/*}"
+    shard_count="${shard##*/}"
+    case "${shard_idx}${shard_count}" in
+        ''|*[!0-9]*) echo "ERROR: bad shard '$shard' (expected i/N)"; exit 1 ;;
+    esac
+    if [ "$shard" != "$shard_idx/$shard_count" ] || [ "$shard_idx" -lt 1 ] || [ "$shard_idx" -gt "$shard_count" ]; then
+        echo "ERROR: bad shard '$shard' (expected i/N with 1 <= i <= N)"; exit 1
+    fi
+fi


+if [ -n "$shard" ] && [ "$shard_count" -gt 1 ]; then
+    shared_marker="build/.prebuild-shared-targets-done"
+    set -- benchmarks/*/case.py
+    first_case="$1"
+    if [ "$shard_idx" -eq 1 ]; then
+        echo "=== Shard 1/$shard_count: building shared targets ==="
+        ./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8
+        touch "$shared_marker"
+    else
+        echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ==="
+        waited=0
+        until [ -f "$shared_marker" ]; do
+            if [ "$waited" -ge 5400 ]; then
+                echo "ERROR: timed out waiting for $shared_marker"; exit 1
+            fi
+            sleep 30
+            waited=$((waited + 30))
+        done
+    fi
+fi


+        echo "=== Shard 1/$shard_count: building shared targets ==="
+        ./mfc.sh build -i "$first_case" -t syscheck pre_process post_process --case-optimization $gpu_opts -j 8
+        touch "$shared_marker"
+    else
+        echo "=== Shard $shard_idx/$shard_count: waiting for shard 1 to build shared targets ==="
+        waited=0
+        until [ -f "$shared_marker" ]; do
+            if [ "$waited" -ge 5400 ]; then
+                echo "ERROR: timed out waiting for $shared_marker"; exit 1
+            fi
+            sleep 30
+            waited=$((waited + 30))
+        done


Four non-AMD Frontier jobs (CCE gpu-omp x2, gpu-acc, cpu) died uniformly at ~51-55 min in yesterday's run; login-node build contention makes the 60-minute budget too tight on bad days.

sbryngelson · 2026-06-12T12:24:36Z

Added: the Frontier dependency-build step timeout is doubled to 120 minutes — four non-AMD Frontier jobs (including cpu) died uniformly at ~51-55 minutes in run 27371326575, consistent with login-node contention rather than code. With the prebuild sharding, this makes #1582 the complete Frontier-timeout fix.

It ran the full suite in one 1:59-walltime SLURM job (job 80982103050 died at the limit); the AMD gpu jobs were already sharded 2-way - the cpu job now matches. The job name gains the [i/2] suffix; if any branch-protection required-check pins the old unsharded name, update it.

sbryngelson · 2026-06-12T12:37:33Z

Third fix added: the Frontier AMD cpu test job ran the full suite in a single 1:59-walltime SLURM job (run 27398258302 died at the limit in Test) while the AMD gpu jobs were already sharded — it now gets the same 2-way sharding. Note the check name changes to include [1/2]/[2/2]; adjust branch protection if it pinned the old name.

Copilot review fixes: shard parts validated independently and the full i/N shape enforced (1/, /2, and bare 12 now rejected); shard 1 clears both markers at start so stale state from reruns cannot skip the wait; a failure marker written via ERR trap lets waiting shards fail fast instead of burning the 90-minute timeout.

sbryngelson · 2026-06-12T12:57:53Z

All three review findings addressed: shard validation now checks each part independently and enforces the full i/N shape (1/, /2, and slashless forms rejected); shard 1 clears both coordination markers at its own start, so stale state from reruns or manual invocations cannot skip the wait; and a failure marker written via ERR trap makes waiting shards fail fast with a pointer to shard 1's log instead of waiting out the 90-minute timeout. 22 standalone test cases over the validation and marker protocol, bash -n and shellcheck clean.

codecov · 2026-06-12T15:11:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.94%. Comparing base (ac5774e) to head (32b22ad).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1582   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files          82       82           
  Lines       19922    19922           
  Branches     2924     2924           
=======================================
  Hits        12141    12141           
  Misses       5805     5805           
  Partials     1976     1976

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ci: shard the case-optimization pre-build into parallel SLURM jobs

d0b6d52

Copilot AI review requested due to automatic review settings June 12, 2026 12:13

sbryngelson mentioned this pull request Jun 12, 2026

CI: compiler cache (ccache) for case-optimization builds #1583

Closed

Copilot AI reviewed Jun 12, 2026

View reviewed changes

ci: double the Frontier dependency-build step timeout to 120 minutes

0567d2f

Four non-AMD Frontier jobs (CCE gpu-omp x2, gpu-acc, cpu) died uniformly at ~51-55 min in yesterday's run; login-node build contention makes the 60-minute budget too tight on bad days.

Copilot started reviewing on behalf of sbryngelson June 12, 2026 12:26 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: shard the AMD case-optimization pre-build#1582

CI: shard the AMD case-optimization pre-build#1582
sbryngelson wants to merge 4 commits into
MFlowCode:masterfrom
sbryngelson:ci-shard-caseopt-prebuild

sbryngelson commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Jun 12, 2026

Problem

Fix

Unchanged

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

sbryngelson commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants