Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572
Phase-3: Riemann hot-path decomposition into shared GPU device helpers#1572sbryngelson wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR is the final stage of a stacked refactor series that decomposes the Riemann-solver hot paths into shared GPU device helpers, tightens multi-rank MPI broadcast correctness via registry-driven generation, and updates build/docs infrastructure to support these changes and performance gating.
Changes:
- Extracts shared, inlined
GPU_ROUTINE(seq, cray_inline)device helpers (mixture aggregation, interface Reynolds averaging, hypoelastic energy, HLLC star momentum flux) and reduces solver hot-loop duplication. - Moves MPI broadcast lists toward registry-driven generation while preserving manual “irregular” residues (struct roots, nested member patterns), addressing prior multi-rank divergence/UB hazards.
- Refactors CMake build plumbing into modular
.cmakeincludes and documents the GPU helper pattern.
Reviewed changes
Copilot reviewed 42 out of 45 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/simulation/m_riemann_state.fpp |
Adds shared GPU device helpers and consolidates per-sweep Riemann state logic. |
toolchain/mfc/params/generators/fortran_gen.py |
Implements generated MPI broadcast include emission and supporting classification/type-mapping helpers. |
cmake/MFCTargets.cmake |
Centralizes target setup/linking logic for MPI/OpenACC/OpenMP/GPU builds. |
cmake/Fypp.cmake |
Wires Fypp preprocessing with explicit generated-include dependencies. |
cmake/ParamsCodegen.cmake |
Adds build-time (ninja-tracked) generation of Fortran parameter include fragments. |
cmake/GPU.cmake |
Collects compiler/GPU-related flags and IPO/LTO handling into a dedicated module. |
| if (NOT TARGET OpenMP::OpenMP_Fortran) | ||
| message(FATAL_ERROR "OpenMP + Fortran is unsupported.") | ||
| endif() | ||
| set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) |
|
|
||
| foreach (a_target ${IPO_TARGETS}) | ||
| set_target_properties(${a_target} PROPERTIES Fortran_PREPROCESS ON) | ||
| message(STATUS ${CMAKE_Fortran_COMPILER_ID}) |
| target_compile_options(${ARGS_TARGET} | ||
| PRIVATE -gpu=mem:unified:managedalloc -cuda | ||
| ) | ||
| # "This option must appear in both the compile and link lines" -- NVHPC Docs | ||
| target_link_options(${ARGS_TARGET} | ||
| PRIVATE -gpu=mem:unified:managedalloc -cuda | ||
| ) |
| target_compile_options(${ARGS_TARGET} | ||
| PRIVATE -DFRONTIER_UNIFIED) | ||
| endif() |
| target_compile_options(${ARGS_TARGET} | ||
| PRIVATE -DFRONTIER_UNIFIED) | ||
| endif() |
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
|
Thanks Copilot — all five findings verified and addressed in b43e445. Context: cmake/MFCTargets.cmake was created by pure motion from the monolithic CMakeLists.txt (the split was verified by byte-identical configure graphs), so all five lines pre-exist on master — but four were worth fixing on substance: the unified-memory flags (the NVHPC managedalloc compile/link pair and both FRONTIER_UNIFIED definitions) now apply to |
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1572 +/- ##
==========================================
+ Coverage 60.94% 61.24% +0.30%
==========================================
Files 82 82
Lines 19922 19810 -112
Branches 2924 2902 -22
==========================================
- Hits 12141 12133 -8
+ Misses 5805 5711 -94
+ Partials 1976 1966 -10 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
b43e445 to
ecb2baa
Compare
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
ecb2baa to
d3bb0bd
Compare
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
…indrical source) Adds three GPU_ROUTINE(cray_inline) helpers to m_riemann_state and replaces the duplicated inline blocks in the HLL, HLLC, and LF solvers (HLLD untouched): s_accumulate_mixture_properties (14 call sites: hll 2, lf 2, hllc 10 across the 6-eq/4-eq/5-eq-bubbles/5-eq branches), s_compute_interface_reynolds (10 call sites: hll 2, lf 4, hllc 4), and f_compute_hllc_star_momentum_flux (6 call sites: hllc cylindrical and azimuthal geometric source flux). NOT NFC: calls replace inline arithmetic, so floating-point ordering may shift within golden tolerance. The hllc 5-eq Re_max Reynolds variant is replaced by the shared form: live lanes are arithmetically identical; previously-uninitialized dead lanes (Re_size(i)==0) now get the defined 1/sgm_eps value. The hllc 5-eq branch loads post-limiter alpha into new alpha_lim arrays so alpha_L/R keep their pre-limiter values used downstream. The HLL/LF cylindrical source blocks and the hllc 5-eq-bubbles (1-alpha) Reynolds variant use different arithmetic and stay inline.
…dead solver copies Implements Task 2 of the Phase-3 Riemann decomposition plan per Spencer's rulings (2026-06-11) on the BLOCKED-ON-PHYSICS adjudication: 1. Energy gate: the helper gates on G > verysmall, not HLL's hard-coded 1000 floor; HLL's TODO asking for its removal is honored and deleted. This is a deliberate behavior change for soft-G (G in (verysmall, 1000]) hypoelastic runs; golden tests are unaffected (test shear moduli are 5e4/1e5). 2. Damage scaling: the (1 - damage) factor applies whenever cont_damage is on (HLL's existing behavior); the helper carries it. 3. Dead copies: the hypoelastic energy-loading blocks in HLLC (both instances) and LF are deleted, since case_validator.py:577 restricts hypoelasticity to the HLL solver. LF's G_L/G_R locals became unused and are removed from the declarations and the private list; HLLC's tau_e/G locals remain in use by its hyperelastic and elastic flux/wave-speed code. 4. Wave-speed corrections are untouched in all solvers; the hyper-tau_e divergence is filed upstream separately. The helper uses the shear_indices-based energy doubling (HLL's dimension-correct form, not the hard-coded i==2/4/5). The caller loads the SF-indexed stress and damage values and passes them in, keeping tau_e_L/R available for the stress-flux and wave-speed blocks; the mixture moduli G_L/G_R are intent(out) for the elastic wave speeds.
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
d3bb0bd to
a577f15
Compare
Flagged by Copilot review on MFlowCode#1572; all pre-existing on master, carried verbatim by the CMakeLists split. Unified-memory flags (NVHPC managedalloc pair, FRONTIER_UNIFIED x2) now apply to a_target so the NVHPC two-pass IPO lib target gets them too; the set(ENV{OMP_TARGET_OFFLOAD} [MANDATORY]) line is deleted outright - set(ENV) only affects the configure-time process so it never influenced built binaries, and its bracketed value was a literal besides; a leftover debug message(STATUS) removed.
… HLLC; doc fixes Guard inventory (every tau_e/G consumer in the two files was classified by its exact guard before deletion): - LF: hypoelastic momentum flux (else if hypoelasticity), hypoelastic energy flux (else if hypoelasticity), stress evolution flux (if hypoelasticity), cylindrical tau_sigmasigma source (if cyl_coord .and. hypoelasticity) - all deleted. - HLLC: stress evolution flux (if hypoelasticity), both 6-eqn and 5-eqn copies - deleted. The hyperelasticity-gated tau_e/G/xi_field loads and the elasticity-gated wave speeds and elastic momentum/energy fluxes are untouched (live under hyperelasticity, which is legal with all solvers). - HLL is untouched. case_validator.py prohibits hypoelasticity with any solver but HLL, so these blocks were unreachable; after PR MFlowCode#1572 deleted the hypoelastic energy blocks they consumed tau_e_L/R that nothing in these files loads. Per the maintainer ruling: prefer deleting code - lifting the restriction later means adding the code back deliberately. Unused locals removed from LF: tau_e_L/R and flux_tau_L/R (zero refs after deletion) and xi_field_L/R (orphaned since the solver split, never referenced in LF) - declarations and GPU private-list entries. HLLC declarations are unchanged (tau_e/G remain live on the hyper/elasticity paths). GPU census: LF -7 acc seq-loop directives (2 deleted seq loops x 3 sweep copies + 1 cyl-only), HLLC -6 (1 seq loop x 3 copies x 2 blocks), OMP path unchanged (seq loops emit nothing under OMP); all deltas trace to deleted blocks. Doc fixes: rescope the cray_inline guidance to cross-module helpers only, describe the AMD case-opt guard as sitting on the helper's dummy declaration with matching caller-local guards, and note in s_compute_hypoelastic_interface_energy that the verysmall gate is the deliberate ruling that replaced HLL's former G > 1000 stability floor and retired its TODO.
Code reviewFound 2 issues, both fixed in fc7f770:
Verified sound across five independent review passes: helper algebra bitwise-faithful at all 30 call sites (sampled), no upstream hotfixes lost in any replaced block (history audit vs #1545/#1556 merges), Herschel-Bulkley viscous arguments preserved, cmake 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Device validation complete on the current head's content (860ecb3b1): full 594-test suite run on GPU nodes under both backends — OpenACC and OpenMP offload — ALL TESTS PASSED on each, alongside the 4-way CPU golden gate. Every change in this PR now has empirical GPU evidence in addition to the emission-level proofs. |
Summary
Per-solver benchmark coverage.
5eq_rk3_weno3_hlland5eq_rk3_weno3_lfjoin the benchmark suite (previously only HLLC and hypo-HLL had solver-specific coverage), so the performance gates below actually measure every solver the changes touch.Tier-1 shared device helpers (
m_riemann_state.fpp, allGPU_ROUTINE(seq, cray_inline)): mixture-properties aggregation (14 call sites across HLL/HLLC/LF), interface-Reynolds averaging (10 sites), and the HLLC star-momentum flux expression (6 sites). Net −50 LOC from the hot loops. Each conversion verified bitwise-identical on live lanes by independent review; blocks whose arithmetic genuinely differs between solvers were deliberately left inline rather than homogenized. Incidental fix: dead viscous lanes previously stored2/(1/uninitialized)intoRe_avg— now a defined value (fixes the UB/FP-trap hazard).Hypoelastic interface energy unified under maintainer rulings (recorded in-repo): one shared helper with the
verysmallgate (HLL'sG > 1000stability floor and itsTODO take outcomment retired), unconditional damage scaling, and the dimension-correctshear_indicesdoubling test; the dead hypoelastic energy blocks in HLLC (×2) and LF deleted outright (the case validator restricts hypoelasticity to HLL). Net −56 LOC. Wave-speed estimates untouched everywhere.Performance gate (<5% ceiling, maintainer-set)
Simulation grind-time ratios vs. the pre-series baseline, all benchmark cases:
5eq_rk3_weno3_hllc1.02 (faster)hypo_hll1.03 (faster)(Ratios are speedups: >1 is faster. Pre/post exec scatter is sub-second I/O noise, documented in the working notes.)
Behavior notes
verysmall < G ≤ 1000) now receive elastic energy the stability floor previously suppressed (maintainer ruling), and dead-laneRe_avgvalues changed from garbage to defined (consumers gated; results-inert).Re_avglanes).Docs
gpuParallelization.mdgains a "Writing GPU_ROUTINE Device Helpers" section (the macro was previously undocumented): the FLOP threshold, theseq+cray_inlineidiom and why Cray requires it, the caller-loads/helper-computes pattern, AMD case-optimization dimensioning, and declare-scoping — with a real helper from this PR as the worked example.