Skip to content

Frontier AMD (flang, gpu-omp): viscous_shock_tube fails golden tolerance deterministically on master #1587

@sbryngelson

Description

@sbryngelson

Symptom (reported by @danieljvickers, confirmed in CI run 27417122769 on a CI-files-only branch, i.e. master's code): test 70EC99CE (2D -> Example -> viscous_shock_tube) fails golden tolerance on Frontier with AMD flang + OpenMP offload. First failing variable is marginal (abs 1.38e-3 vs 1.0e-3 tolerance); the large relative errors sit on near-zero golden values inside the shock. The failure is deterministic — three repeat runs produce the identical candidate value (1.0118017002832).

Bisect evidence (real Frontier hardware, 3x repeats per ref):

Cross-backend evidence: the same test on the same refs PASSES under gpu-omp on NVIDIA (Phoenix), 3/3 both refs, and all CPU lanes are golden-green — so this is a bit-level FP-evaluation-ordering change specific to the AMD flang toolchain, within tolerance everywhere else and over the line on this one razor-margin case. Not wrong physics.

Leading hypothesis (if the midpoint passes): the #1556 module split introduced cross-module device-helper calls; cray_inline=True covers the Cray compiler, but AMD flang may decline to inline across modules, changing FMA/contraction order. Fix family: per-file inline flags for the riemann modules under LLVMFlang in cmake/MFCTargets.cmake (the per-file -Mnoinline NVHPC precedent exists), a flang-side inline hint in the GPU_ROUTINE macro, or — last resort — a per-backend tolerance for this case.

Open secondary question: the #1556 pre-merge CI had both Frontier AMD gpu-omp shards green, which implies the 2D -> Example -> * tests were not in those shards' selection — a GPU CI coverage gap worth confirming and possibly its own issue.

Will update with the midpoint attribution and the fix PR. Part of the post-refactor audit series (cf. #1579, #1580).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions