Skip to content

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572

Open
sbryngelson wants to merge 6 commits into
MFlowCode:masterfrom
sbryngelson:phase3-riemann
Open

Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572
sbryngelson wants to merge 6 commits into
MFlowCode:masterfrom
sbryngelson:phase3-riemann

Conversation

@sbryngelson

Copy link
Copy Markdown
Member

Stacked on #1551#1552#1553#1555#1556 — its own content is the top 4 commits. Completes the refactoring roadmap's Phase 3 (hot-path decomposition), the final phase of the series.

Summary

  1. Per-solver benchmark coverage. 5eq_rk3_weno3_hll and 5eq_rk3_weno3_lf join the benchmark suite (previously only HLLC and hypo-HLL had solver-specific coverage), so the performance gates below actually measure every solver the changes touch.

  2. Tier-1 shared device helpers (m_riemann_state.fpp, all GPU_ROUTINE(seq, cray_inline)): mixture-properties aggregation (14 call sites across HLL/HLLC/LF), interface-Reynolds averaging (10 sites), and the HLLC star-momentum flux expression (6 sites). Net −50 LOC from the hot loops. Each conversion verified bitwise-identical on live lanes by independent review; blocks whose arithmetic genuinely differs between solvers were deliberately left inline rather than homogenized. Incidental fix: dead viscous lanes previously stored 2/(1/uninitialized) into Re_avg — now a defined value (fixes the UB/FP-trap hazard).

  3. Hypoelastic interface energy unified under maintainer rulings (recorded in-repo): one shared helper with the verysmall gate (HLL's G > 1000 stability floor and its TODO take out comment retired), unconditional damage scaling, and the dimension-correct shear_indices doubling test; the dead hypoelastic energy blocks in HLLC (×2) and LF deleted outright (the case validator restricts hypoelasticity to HLL). Net −56 LOC. Wave-speed estimates untouched everywhere.

Performance gate (<5% ceiling, maintainer-set)

Simulation grind-time ratios vs. the pre-series baseline, all benchmark cases:

stage range watched case
Tier-1 helpers 0.99–1.04 5eq_rk3_weno3_hllc 1.02 (faster)
Hypoelastic unification 0.98–1.03 hypo_hll 1.03 (faster)

(Ratios are speedups: >1 is faster. Pre/post exec scatter is sub-second I/O noise, documented in the working notes.)

Behavior notes

Docs

gpuParallelization.md gains a "Writing GPU_ROUTINE Device Helpers" section (the macro was previously undocumented): the FLOP threshold, the seq+cray_inline idiom and why Cray requires it, the caller-loads/helper-computes pattern, AMD case-optimization dimensioning, and declare-scoping — with a real helper from this PR as the worked example.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is the final stage of a stacked refactor series that decomposes the Riemann-solver hot paths into shared GPU device helpers, tightens multi-rank MPI broadcast correctness via registry-driven generation, and updates build/docs infrastructure to support these changes and performance gating.

Changes:

  • Extracts shared, inlined GPU_ROUTINE(seq, cray_inline) device helpers (mixture aggregation, interface Reynolds averaging, hypoelastic energy, HLLC star momentum flux) and reduces solver hot-loop duplication.
  • Moves MPI broadcast lists toward registry-driven generation while preserving manual “irregular” residues (struct roots, nested member patterns), addressing prior multi-rank divergence/UB hazards.
  • Refactors CMake build plumbing into modular .cmake includes and documents the GPU helper pattern.

Reviewed changes

Copilot reviewed 42 out of 45 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/simulation/m_riemann_state.fpp Adds shared GPU device helpers and consolidates per-sweep Riemann state logic.
toolchain/mfc/params/generators/fortran_gen.py Implements generated MPI broadcast include emission and supporting classification/type-mapping helpers.
cmake/MFCTargets.cmake Centralizes target setup/linking logic for MPI/OpenACC/OpenMP/GPU builds.
cmake/Fypp.cmake Wires Fypp preprocessing with explicit generated-include dependencies.
cmake/ParamsCodegen.cmake Adds build-time (ninja-tracked) generation of Fortran parameter include fragments.
cmake/GPU.cmake Collects compiler/GPU-related flags and IPO/LTO handling into a dedicated module.

Comment thread cmake/MFCTargets.cmake Outdated
if (NOT TARGET OpenMP::OpenMP_Fortran)
message(FATAL_ERROR "OpenMP + Fortran is unsupported.")
endif()
set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY])
Comment thread cmake/MFCTargets.cmake Outdated

foreach (a_target ${IPO_TARGETS})
set_target_properties(${a_target} PROPERTIES Fortran_PREPROCESS ON)
message(STATUS ${CMAKE_Fortran_COMPILER_ID})
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +212 to +218
target_compile_options(${ARGS_TARGET}
PRIVATE -gpu=mem:unified:managedalloc -cuda
)
# "This option must appear in both the compile and link lines" -- NVHPC Docs
target_link_options(${ARGS_TARGET}
PRIVATE -gpu=mem:unified:managedalloc -cuda
)
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +229 to +231
target_compile_options(${ARGS_TARGET}
PRIVATE -DFRONTIER_UNIFIED)
endif()
Comment thread cmake/MFCTargets.cmake Outdated
Comment on lines +242 to +244
target_compile_options(${ARGS_TARGET}
PRIVATE -DFRONTIER_UNIFIED)
endif()
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 11, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
@sbryngelson

Copy link
Copy Markdown
Member Author

Thanks Copilot — all five findings verified and addressed in b43e445. Context: cmake/MFCTargets.cmake was created by pure motion from the monolithic CMakeLists.txt (the split was verified by byte-identical configure graphs), so all five lines pre-exist on master — but four were worth fixing on substance: the unified-memory flags (the NVHPC managedalloc compile/link pair and both FRONTIER_UNIFIED definitions) now apply to ${a_target} so the NVHPC two-pass IPO lib target receives them consistently, and the leftover debug message(STATUS) is gone. The set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line was deleted outright rather than de-bracketed: set(ENV{...}) only affects the configure-time CMake process environment, so it never influenced the built binaries regardless of the bracket bug.

sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.76190% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.24%. Comparing base (ac5774e) to head (fc7f770).

Files with missing lines Patch % Lines
src/simulation/m_riemann_solver_hllc.fpp 58.62% 12 Missing ⚠️
src/simulation/m_riemann_state.fpp 92.10% 0 Missing and 3 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   60.94%   61.24%   +0.30%     
==========================================
  Files          82       82              
  Lines       19922    19810     -112     
  Branches     2924     2902      -22     
==========================================
- Hits        12141    12133       -8     
+ Misses       5805     5711      -94     
+ Partials     1976     1966      -10     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
…indrical source)

Adds three GPU_ROUTINE(cray_inline) helpers to m_riemann_state and replaces the duplicated inline blocks in the HLL, HLLC, and LF solvers (HLLD untouched): s_accumulate_mixture_properties (14 call sites: hll 2, lf 2, hllc 10 across the 6-eq/4-eq/5-eq-bubbles/5-eq branches), s_compute_interface_reynolds (10 call sites: hll 2, lf 4, hllc 4), and f_compute_hllc_star_momentum_flux (6 call sites: hllc cylindrical and azimuthal geometric source flux).

NOT NFC: calls replace inline arithmetic, so floating-point ordering may shift within golden tolerance. The hllc 5-eq Re_max Reynolds variant is replaced by the shared form: live lanes are arithmetically identical; previously-uninitialized dead lanes (Re_size(i)==0) now get the defined 1/sgm_eps value. The hllc 5-eq branch loads post-limiter alpha into new alpha_lim arrays so alpha_L/R keep their pre-limiter values used downstream. The HLL/LF cylindrical source blocks and the hllc 5-eq-bubbles (1-alpha) Reynolds variant use different arithmetic and stay inline.
…dead solver copies

Implements Task 2 of the Phase-3 Riemann decomposition plan per Spencer's rulings (2026-06-11) on the BLOCKED-ON-PHYSICS adjudication:

1. Energy gate: the helper gates on G > verysmall, not HLL's hard-coded 1000 floor; HLL's TODO asking for its removal is honored and deleted. This is a deliberate behavior change for soft-G (G in (verysmall, 1000]) hypoelastic runs; golden tests are unaffected (test shear moduli are 5e4/1e5).

2. Damage scaling: the (1 - damage) factor applies whenever cont_damage is on (HLL's existing behavior); the helper carries it.

3. Dead copies: the hypoelastic energy-loading blocks in HLLC (both instances) and LF are deleted, since case_validator.py:577 restricts hypoelasticity to the HLL solver. LF's G_L/G_R locals became unused and are removed from the declarations and the private list; HLLC's tau_e/G locals remain in use by its hyperelastic and elastic flux/wave-speed code.

4. Wave-speed corrections are untouched in all solvers; the hyper-tau_e divergence is filed upstream separately.

The helper uses the shear_indices-based energy doubling (HLL's dimension-correct form, not the hard-coded i==2/4/5). The caller loads the SF-indexed stress and damage values and passes them in, keeping tau_e_L/R available for the stress-flux and wave-speed blocks; the mixture moduli G_L/G_R are intent(out) for the elastic wave speeds.
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 12, 2026
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
… HLLC; doc fixes

Guard inventory (every tau_e/G consumer in the two files was classified by its
exact guard before deletion):
- LF: hypoelastic momentum flux (else if hypoelasticity), hypoelastic energy
  flux (else if hypoelasticity), stress evolution flux (if hypoelasticity),
  cylindrical tau_sigmasigma source (if cyl_coord .and. hypoelasticity) - all
  deleted.
- HLLC: stress evolution flux (if hypoelasticity), both 6-eqn and 5-eqn copies
  - deleted. The hyperelasticity-gated tau_e/G/xi_field loads and the
  elasticity-gated wave speeds and elastic momentum/energy fluxes are
  untouched (live under hyperelasticity, which is legal with all solvers).
- HLL is untouched.

case_validator.py prohibits hypoelasticity with any solver but HLL, so these
blocks were unreachable; after PR MFlowCode#1572 deleted the hypoelastic energy blocks
they consumed tau_e_L/R that nothing in these files loads. Per the maintainer
ruling: prefer deleting code - lifting the restriction later means adding the
code back deliberately.

Unused locals removed from LF: tau_e_L/R and flux_tau_L/R (zero refs after
deletion) and xi_field_L/R (orphaned since the solver split, never referenced
in LF) - declarations and GPU private-list entries. HLLC declarations are
unchanged (tau_e/G remain live on the hyper/elasticity paths).

GPU census: LF -7 acc seq-loop directives (2 deleted seq loops x 3 sweep
copies + 1 cyl-only), HLLC -6 (1 seq loop x 3 copies x 2 blocks), OMP path
unchanged (seq loops emit nothing under OMP); all deltas trace to deleted
blocks.

Doc fixes: rescope the cray_inline guidance to cross-module helpers only,
describe the AMD case-opt guard as sitting on the helper's dummy declaration
with matching caller-local guards, and note in
s_compute_hypoelastic_interface_energy that the verysmall gate is the
deliberate ruling that replaced HLL's former G > 1000 stability floor and
retired its TODO.
@sbryngelson

Copy link
Copy Markdown
Member Author

Code review

Found 2 issues, both fixed in fc7f770:

  1. The hypoelastic energy-block deletions left LF and HLLC with hypoelasticity-gated flux/source blocks consuming tau_e_L/R that nothing loads (flagged independently by three review passes). Unreachable through the toolchain (case_validator.py:577 restricts hypoelasticity to HLL), but a latent-UB trap if that restriction is ever lifted. Resolved by completing the dead-copy deletion under the same maintainer ruling: every hypoelasticity-gated block in LF/HLLC removed after a guard-by-guard inventory; all elasticity/hyperelasticity-gated code untouched (with HLLC, elasticity implies hyperelasticity, so the kept consumers are always fed by the kept loads). −60 LOC of dead code, plus orphaned locals.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/src/simulation/m_riemann_solver_lf.fpp#L443-L450

  1. The new GPU_ROUTINE documentation overstated cray_inline=True as required "always" while two same-module helpers in the same file correctly omit it, and misplaced the AMD dimensioning guard as "caller-side". Both passages corrected (cray_inline scoped to cross-module helpers; the guard sits on the helper's dummy declaration), and the hypoelastic helper's doc comment now records the deliberate verysmall-gate ruling.

https://github.com/MFlowCode/MFC/blob/a577f15e09134f4f12bd49646bee9ed8fcab84a0/docs/documentation/gpuParallelization.md#L636-L640

Verified sound across five independent review passes: helper algebra bitwise-faithful at all 30 call sites (sampled), no upstream hotfixes lost in any replaced block (history audit vs #1545/#1556 merges), Herschel-Bulkley viscous arguments preserved, cmake a_target fixes correct, all five prior Copilot points confirmed addressed, merge-base current. Known and deliberately preserved pending physics rulings: #1570 (damage-R offset), #1557/#1571 — declared in the PR description.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@sbryngelson

Copy link
Copy Markdown
Member Author

Device validation complete on the current head's content (860ecb3b1): full 594-test suite run on GPU nodes under both backends — OpenACC and OpenMP offload — ALL TESTS PASSED on each, alongside the 4-way CPU golden gate. Every change in this PR now has empirical GPU evidence in addition to the emission-level proofs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants