Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588
Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588sbryngelson wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses an AMD flang OpenMP-offload GPU correctness regression (post-#1556 module split) where Re_size could remain zero on-device in solver translation units, effectively disabling viscous terms without an error. The fix makes Re_size allocatable so it is handled through the OpenMP runtime mapping machinery (consistent with Re_idx / Res_gs) and thus behaves correctly across translation units.
Changes:
- Change
Re_sizefrom a staticinteger, dimension(2)tointeger, allocatable, dimension(:). - Allocate
Re_size(1:2)during global-parameters initialization before first use and keep initialization to zero. - Deallocate
Re_sizein the module finalization path.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1588 +/- ##
=======================================
Coverage 60.94% 60.94%
=======================================
Files 82 82
Lines 19922 19925 +3
Branches 2924 2924
=======================================
+ Hits 12141 12144 +3
Misses 5805 5805
Partials 1976 1976 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
baf0884 to
ec9c371
Compare
…host-capture (firstprivate) The declare-target-roulette case shows two identical declare-target arrays disagreeing in one kernel based only on the push mechanism (enter data map vs update to), verified on ROCm 7.2.0 and AFAR drop 23.2.0. Also corrects declare-target-static-tu's Fix section: making the variable allocatable is NOT reliable in the full app (it only moves the bug -- an inversion), the robust fix is host-capture + firstprivate (MFlowCode/MFC#1588).
dd1eef1 to
203ba70
Compare
…plete Riemann private lists) AMD flang reads the static declare-target Re_size stale across translation units, silently disabling viscosity. Host-capture it into Re_size_loc and firstprivate it into the Riemann kernels for all compilers. That firstprivate clause exposed a latent bug: several per-cell scalars (s_M/s_P/xi_M/xi_P in HLL, c_sum_Yi_Phi/flux_ene_e in HLLC, Gamm_L/Gamm_R/flux_tau_L/flux_tau_R in LF) were omitted from private() and relied on CCE-19's defaultmap(firstprivate:scalar) auto-privatization, which the firstprivate clause disrupts (gross wrong physics on Cray). Complete the private lists -- a no-op on every compiler -- so firstprivate is safe everywhere; no compiler gate needed. Verified on Frontier: AMD-flang 29/29 and Cray-CCE (rs=1/2/5) of the previously-failing cases pass.
203ba70 to
0d995d5
Compare
The host-capture firstprivate(Re_size_loc) added to the LF/HLL/HLLC solvers (the AMD-flang viscosity fix) pushed the longest GPU_PARALLEL_LOOP directive past nvfortran's ~1000-char source-line limit (1011 chars -> 'source line too long' / unbalanced parentheses; AMD and Cray accept the long line, nvfortran caps it). FOLD_DIRECTIVE wraps the assembled directive at whole-clause boundaries with repeated sentinels (!$acc&/!$omp&), so the longest emitted line is prefix+longest-single-clause (817 chars on LF) -- shorter than the single line a build with one fewer clause already compiles. firstprivate is preserved (AMD/Cray correctness) on its own continuation. Validated: nvfortran 25.5 -acc and -mp=gpu compile+run the folded marker-interleaved continuation correctly; the cpp line-markers between continuations are already emitted by master for regular Fortran continuations and compile on Cray/AMD CI. Short directives (<200 chars) are unchanged.
Closes #1587. (The branch name is stale — this no longer makes
Re_sizeallocatable; see below.)Summary
Fixes silently-wrong viscous results on the Frontier AMD-flang OpenMP-offload GPU build — the
2D_viscous_shock_tuberegression introduced by #1556. 3 files, solver-only;Re_sizestays static.Root cause
On AMD flang OpenMP offload, a cross-translation-unit read of the static
declare targetRe_sizeinside a Riemann-solver kernel is unreliable and codegen-dependent: some solver kernels read it as0, soif (Re_size(i) > 0)is false,Re_L/Re_Rcollapse todflt_real, the interface Reynolds number becomes1e16, and viscosity is silently switched off. CPU and NVIDIA (nvfortran) builds are unaffected — verified across all 14 NVHPC CI lanes.Why not "make
Re_sizeallocatable" (the previous version of this PR)That approach only moved the bug. On AMD flang it fixed HLL/HLLC but broke the LF (
riemann_solver=5) and IBM viscous paths — they pass on master and failed with the allocatable change (an inversion: static breaks some solver kernels, allocatable breaks others). The defect is per-kernel codegen, not a clean static-vs-allocatable rule, so no choice of storage class is reliable for every kernel. (Confirmed on hardware: per-solverGPU_UPDATEworkarounds also only fixed some kernels.)The fix
Keep
Re_sizestatic; in each ofs_hll/hllc/lf_riemann_solver, capture it into a host-set localRe_size_locandfirstprivatethat value into the viscous kernels, so they never readRe_sizefrom declare-target device memory:Removing the declare-target read removes the variable the compiler bug acts on, so it's immune regardless of which kernel or solver.
Verification (Frontier AMD flang, gpu-omp)
All five previously-affected viscous tests pass with this fix:
70EC99CEviscous_shock_tube,8EAC3DA71D viscous canary,27B24CD9+4C812637LF-viscous (the two the allocatable approach broke),B0CE19C5IBM-viscous.format+precheckclean.Upstream
This is a workaround for an
amdflangbug, filed with a standalone minimal reproducer: ROCm/llvm-project#2890, llvm/llvm-project#203711 (still reproduces on the newest amdflang, AFAR 23.2.0 / LLVM 23). The always-run viscous-GPU canary in #1590 is what surfaced the inversion before merge.