Skip to content

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588

Open
sbryngelson wants to merge 3 commits into
masterfrom
fix-amdflang-re-size-allocatable
Open

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588
sbryngelson wants to merge 3 commits into
masterfrom
fix-amdflang-re-size-allocatable

Conversation

@sbryngelson

@sbryngelson sbryngelson commented Jun 13, 2026

Copy link
Copy Markdown
Member

Closes #1587. (The branch name is stale — this no longer makes Re_size allocatable; see below.)

Summary

Fixes silently-wrong viscous results on the Frontier AMD-flang OpenMP-offload GPU build — the 2D_viscous_shock_tube regression introduced by #1556. 3 files, solver-only; Re_size stays static.

Root cause

On AMD flang OpenMP offload, a cross-translation-unit read of the static declare target Re_size inside a Riemann-solver kernel is unreliable and codegen-dependent: some solver kernels read it as 0, so if (Re_size(i) > 0) is false, Re_L/Re_R collapse to dflt_real, the interface Reynolds number becomes 1e16, and viscosity is silently switched off. CPU and NVIDIA (nvfortran) builds are unaffected — verified across all 14 NVHPC CI lanes.

Why not "make Re_size allocatable" (the previous version of this PR)

That approach only moved the bug. On AMD flang it fixed HLL/HLLC but broke the LF (riemann_solver=5) and IBM viscous paths — they pass on master and failed with the allocatable change (an inversion: static breaks some solver kernels, allocatable breaks others). The defect is per-kernel codegen, not a clean static-vs-allocatable rule, so no choice of storage class is reliable for every kernel. (Confirmed on hardware: per-solver GPU_UPDATE workarounds also only fixed some kernels.)

The fix

Keep Re_size static; in each of s_hll/hllc/lf_riemann_solver, capture it into a host-set local Re_size_loc and firstprivate that value into the viscous kernels, so they never read Re_size from declare-target device memory:

integer, dimension(2) :: Re_size_loc   ! host copy
...
Re_size_loc = Re_size
$:GPU_PARALLEL_LOOP(collapse=3, firstprivate='[Re_size_loc]', private='[...]')
...
do q = 1, Re_size_loc(i)   ! was Re_size(i)

Removing the declare-target read removes the variable the compiler bug acts on, so it's immune regardless of which kernel or solver.

Verification (Frontier AMD flang, gpu-omp)

All five previously-affected viscous tests pass with this fix:
70EC99CE viscous_shock_tube, 8EAC3DA7 1D viscous canary, 27B24CD9 + 4C812637 LF-viscous (the two the allocatable approach broke), B0CE19C5 IBM-viscous. format + precheck clean.

Upstream

This is a workaround for an amdflang bug, filed with a standalone minimal reproducer: ROCm/llvm-project#2890, llvm/llvm-project#203711 (still reproduces on the newest amdflang, AFAR 23.2.0 / LLVM 23). The always-run viscous-GPU canary in #1590 is what surfaced the inversion before merge.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an AMD flang OpenMP-offload GPU correctness regression (post-#1556 module split) where Re_size could remain zero on-device in solver translation units, effectively disabling viscous terms without an error. The fix makes Re_size allocatable so it is handled through the OpenMP runtime mapping machinery (consistent with Re_idx / Res_gs) and thus behaves correctly across translation units.

Changes:

  • Change Re_size from a static integer, dimension(2) to integer, allocatable, dimension(:).
  • Allocate Re_size(1:2) during global-parameters initialization before first use and keep initialization to zero.
  • Deallocate Re_size in the module finalization path.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 42.85714% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.94%. Comparing base (ac5774e) to head (ecf7f22).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_riemann_solver_hllc.fpp 50.00% 3 Missing and 2 partials ⚠️
src/simulation/m_riemann_solver_lf.fpp 42.85% 0 Missing and 4 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp 25.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1588   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files          82       82           
  Lines       19922    19925    +3     
  Branches     2924     2924           
=======================================
+ Hits        12141    12144    +3     
  Misses       5805     5805           
  Partials     1976     1976           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from baf0884 to ec9c371 Compare June 13, 2026 19:49
@sbryngelson sbryngelson changed the title Make Re_size allocatable: fix silent viscosity loss on AMD flang GPU Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers) Jun 13, 2026
sbryngelson added a commit to sbryngelson/compiler-bugs that referenced this pull request Jun 13, 2026
…host-capture (firstprivate)

The declare-target-roulette case shows two identical declare-target arrays disagreeing in one kernel based only on the push mechanism (enter data map vs update to), verified on ROCm 7.2.0 and AFAR drop 23.2.0. Also corrects declare-target-static-tu's Fix section: making the variable allocatable is NOT reliable in the full app (it only moves the bug -- an inversion), the robust fix is host-capture + firstprivate (MFlowCode/MFC#1588).
@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch 2 times, most recently from dd1eef1 to 203ba70 Compare June 14, 2026 02:45
…plete Riemann private lists)

AMD flang reads the static declare-target Re_size stale across translation units, silently disabling viscosity. Host-capture it into Re_size_loc and firstprivate it into the Riemann kernels for all compilers. That firstprivate clause exposed a latent bug: several per-cell scalars (s_M/s_P/xi_M/xi_P in HLL, c_sum_Yi_Phi/flux_ene_e in HLLC, Gamm_L/Gamm_R/flux_tau_L/flux_tau_R in LF) were omitted from private() and relied on CCE-19's defaultmap(firstprivate:scalar) auto-privatization, which the firstprivate clause disrupts (gross wrong physics on Cray). Complete the private lists -- a no-op on every compiler -- so firstprivate is safe everywhere; no compiler gate needed. Verified on Frontier: AMD-flang 29/29 and Cray-CCE (rs=1/2/5) of the previously-failing cases pass.
@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from 203ba70 to 0d995d5 Compare June 14, 2026 02:59
The host-capture firstprivate(Re_size_loc) added to the LF/HLL/HLLC solvers (the AMD-flang viscosity fix) pushed the longest GPU_PARALLEL_LOOP directive past nvfortran's ~1000-char source-line limit (1011 chars -> 'source line too long' / unbalanced parentheses; AMD and Cray accept the long line, nvfortran caps it). FOLD_DIRECTIVE wraps the assembled directive at whole-clause boundaries with repeated sentinels (!$acc&/!$omp&), so the longest emitted line is prefix+longest-single-clause (817 chars on LF) -- shorter than the single line a build with one fewer clause already compiles. firstprivate is preserved (AMD/Cray correctness) on its own continuation. Validated: nvfortran 25.5 -acc and -mp=gpu compile+run the folded marker-interleaved continuation correctly; the cpp line-markers between continuations are already emitted by master for regular Fortran continuations and compile on Cray/AMD CI. Short directives (<200 chars) are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Frontier AMD (flang, gpu-omp): viscous_shock_tube fails golden tolerance deterministically on master

2 participants