Skip to content

bugfix: allowNonConverged warning instead of aborting#4075

Open
MelReyCG wants to merge 6 commits into
developfrom
bugfix/rey/repair-allowNonConverged
Open

bugfix: allowNonConverged warning instead of aborting#4075
MelReyCG wants to merge 6 commits into
developfrom
bugfix/rey/repair-allowNonConverged

Conversation

@MelReyCG

@MelReyCG MelReyCG commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

When max sub-steps are exceeded and allowNonConverged is enabled, the solver now warns (rank 0) and continues instead of aborting (this PR assumes allowNonConverged was intended to let the simulation continue even when not converging).

@MelReyCG MelReyCG self-assigned this Jun 4, 2026
@MelReyCG MelReyCG added ci: run CUDA builds Allows to triggers (costly) CUDA jobs flag: ready for review ci: run integrated tests Allows to run the integrated tests in GEOS CI labels Jun 4, 2026
@MelReyCG MelReyCG changed the title bugfix: allowNonConverged bugfix: allowNonConverged warning instead of aborting Jun 4, 2026
Comment thread src/coreComponents/physicsSolvers/PhysicsSolverBase.cpp Outdated
Co-authored-by: Dickson Kachuma <81433670+dkachuma@users.noreply.github.com>
@rrsettgast

rrsettgast commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

This flag seems like a bad idea. Can someone explain what measure will determine that results are valid after violating the desired tolerance for the governing equations? How critical is the violation? It seems that there can be repeated violations? Can every timestep in a simulation violate and complete? What if the residual is very high? Is there a criteria for termination then?

@joshua-white

joshua-white commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

What was the previous behavior if this option was enabled? How does this change interact with other checks within the physics solvers themselves, e.g.

if( !isConfigurationLoopConverged )
{
GEOS_LOG_RANK_0( "Convergence not achieved." );
if( allowNonConverged )
{
GEOS_LOG_RANK_0( "The accepted solution may be inaccurate." );
}
else
{
GEOS_ERROR( "Nonconverged solutions not allowed. Terminating...", getDataContext() );
}
}

@joshua-white

Copy link
Copy Markdown
Contributor

If we keep this, I would suggest two changes:

  1. Modify the XML key so the user is very clear they are entering a danger zone. I would suggest making the string something like DANGER_ZONE_allowNonConverged so the user has to markup their input deck with an obvious "I know what I am doing" flag.
  2. We could force modify the output file string to automatically append something like myOutputName_UNRELIABLE so that GEOS acknowledges it itself doesn't trust the results.

I can see selected situations where for debugging purposes a user may want to keep the simulation running to see what happens and gather more info, but it has to be very clear that they are taking ownership of any poor quality results.

@dkachuma

dkachuma commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

The concept behind this flag is actually quite old. This PR extends its behaviour so that it also applies to time stepping, effectively disabling the maxSubStep limit when the flag is active. Maybe this can be done differently.

That said, my assumption was that the broader discussion around result validity and the handling or interpretation of tolerance violations had already taken place when the flag was originally introduced at #278.

@MelReyCG

MelReyCG commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the feedback.

On naming/output changes:
Modifying XML/input/output filenames (e.g., _UNRELIABLE suffix or DANGER_ZONE prefix) seems to break consistency with other non-physical options like allowNegativePressure.

About renaming the output file, when?

  • at initialization, we risk disrupting workflows, and we don't know if the simulation had effectively encounter problematic timesteps, but that may be what we want. Does it applies only to the HDF5s? The output folder?
  • when non-convergence occurs, renaming the files / output folder seems tricky, and we cannot wait the end of the simulation because if a crash occurs, the file is not marked as _UNRELIABLE.

I propose to:

  • add a clear warning at startup (as early as possible) when allowNonConverged is enabled.
  • Include a prominent summary table at the end of the simulation, something like this:
  ----------------------------------------------------------------------------------------------------
  |                                   2 NON-CONVERGED TIMESTEPS                                      |
  |--------------------------------------------------------------------------------------------------|
  |                      ERROR: Simulation ended with non-converged timesteps.                       |
  |                Results contain NON-PHYSICAL values and are likely invalid for use.               |
  |   The simulation has not been aborted because 'allowNonConverged' was active on a solver.        |
  |--------------------------------------------------------------------------------------------------|
  |                 Cycle   |                 Time (s)       |                   Residual            |
  |-------------------------|--------------------------------|---------------------------------------|
  |                     18  |                     1.800e+06  |                             6.23e-02  |
  |                     19  |                     1.900e+06  |                             8.34e-05  |
  ----------------------------------------------------------------------------------------------------

(we could add the failing solver name too)

We may also want to add a column in the solver CSV.

On consolidation: I don't see other places where this flag might interact. To me, it seem currently limited to the solver-level logic in PhysicsSolverBase.

After that PR, do we do the same for allowNegative* attibutes?

@joshua-white

Copy link
Copy Markdown
Contributor

Okay, this seems like a reasonable compromise to me, as the final table provides clear and useful info. The name itself should be a red flag for anyone competent (who puts allowNonConverged to true and expects good things to happen?). We could revisit a better handling of this in a follow up PR. Let's just not advertise this flag widely or push XMLs into the repo with it activated, in case someone pulls them as a template?

I am less familiar with that part of the code (so don't trust me on this) but I think allowNegativePressure may be a different beast. Analytical hydrofracture solutions allow for large negative pressures as the tip, even if in reality the liquid would cavitate. @rrsettgast would know better here though.

@joshua-white

Copy link
Copy Markdown
Contributor

I just did a search for allowNonConverged and there is another block in CoupledSolver and then a few smoke tests have it activated (including recent ALM tests)

@MelReyCG

MelReyCG commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Indeed, CoupledSover do use allowNonConverged, but seem to behave as expected (warning instead of crashing). The problem was in an higher layer than in the solvers *Step functions, the execute() was doing a final check after substeps reached the deepest allowed level, and this check was ignoring allowNonConverged.

Do we want to exclude or at least comment these usages of allowNonConverged in the smoke tests? There is currently not even a note on them.

@jhuang2601 jhuang2601 requested a review from bd713 as a code owner June 12, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants