Skip to content

Promote per-source thermodynamics to canonical deltag (+ retrained dGPredictor-ModelSEED source)#265

Open
freiburgermsu wants to merge 2 commits into
ModelSEED:devfrom
freiburgermsu:promote-thermodynamics-to-canonical-deltag
Open

Promote per-source thermodynamics to canonical deltag (+ retrained dGPredictor-ModelSEED source)#265
freiburgermsu wants to merge 2 commits into
ModelSEED:devfrom
freiburgermsu:promote-thermodynamics-to-canonical-deltag

Conversation

@freiburgermsu

Copy link
Copy Markdown
Member

Summary

Gives 14,141 reactions a canonical free-energy value (deltag/deltagerr/reversibility) that they were missing — purely by re-aggregating estimates that already exist in the additive thermodynamics dict, plus the retrained dGPredictor source that supplies some of them.

Background: the promotion gap

After the additive-thermodynamics refactor, the Update_Reaction_*_Energies.py scripts write only into each reaction's additive thermodynamics dict and no longer populate the canonical top-level deltag/deltagerr. As a result thousands of reactions carried a perfectly good computed energy in thermodynamics while their canonical deltag stayed the 10000000 sentinel (and reversibility "?"), so they read as thermodynamically undefined despite the value already existing.

What this PR does

Two commits:

  1. dGPredictor-ModelSEED source (existing additive work) — the dGPredictor group-contribution model retrained on ModelSEED structures, recorded as its own per-method entry in thermodynamics. Additive; supplies energies for reactions the other sources miss.

  2. Promotion — new Scripts/Thermodynamics/Promote_Reaction_Thermodynamics_to_Canonical.py re-aggregates the stored per-source estimates into the canonical fields. Pure re-aggregation: no new estimation, no external dependencies.

    • Only reactions with a missing canonical deltag are touched; existing canonical values are never overwritten.
    • Selection: prefer the mechanistic/measurement-anchored tier (eQuilibratorGroup contribution) over the ML tier (dGPredictor-ModelSEED, dGPredictor); within the chosen tier take the lowest-uncertainty estimate. The within-tier lowest-error rule stops a wildly-uncertain ML outlier (e.g. -100 ± 71 kcal/mol) being promoted over a tight estimate (-8.6 ± 0.04).
    • Guards reject implausible magnitudes (|dG| > 1000) and useless uncertainties (> 100 kcal/mol), leaving those undefined rather than promoting garbage.
    • reversibility is set to the chosen estimate's own direction operator (same heuristic as Estimate_Reaction_Reversibility.py, already stored with each per-source energy).

Result

14,141 reactions promoted: Group contribution 1,474 · dGPredictor 8,635 · dGPredictor-ModelSEED 4,032.

Verified: every promoted deltag equals one of that reaction's stored per-source energies; zero pre-existing canonical values, thermodynamics dicts, or other fields changed. (.tsv files update because deltag/deltagerr/reversibility are TSV columns.)

The source-precedence policy is a single editable TIERS constant; ~732 reactions have >50 kcal/mol cross-source disagreement and are worth a curator spot-check (the policy resolves them by tier + lowest error, not averaging).

🤖 Generated with Claude Code

freiburgermsu and others added 2 commits June 10, 2026 16:25
…urce

Records the dGPredictor group-contribution model retrained on the ModelSEED
compound structures as its own per-method entry, "dGPredictor-ModelSEED", in
each reaction's `thermodynamics` dict. Purely additive: it sits next to the
Group contribution / eQuilibrator / (original KEGG-based) dGPredictor records,
and the original "dGPredictor" entry is left untouched. The canonical
deltag / deltagerr / reversibility are not changed, and no .tsv or compound
files change.

- New staged predictions: Biochemistry/Thermodynamics/dGPredictor/
  modelseed_retrained_dG.json (31,924 reactions, kJ/mol).
- New writer: Scripts/Thermodynamics/Update_Reaction_dGPredictor_ModelSEED_
  Energies.py (kJ->kcal /4.184; operator via reversibility_from_energy).
- 31,924 reactions gain a dGPredictor-ModelSEED record (incl. ~11,400 the
  original KEGG-based dGPredictor could not reach); 24,088 reactions unchanged.
- Verified: every modified reaction differs from dev ONLY by the added
  dGPredictor-ModelSEED key; added values equal dG_mean/4.184; the writer is
  idempotent.
- Docs: sources.yaml, Scripts/Thermodynamics/README.md, Rerun_Thermodynamics.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…141 reactions)

After the additive-thermodynamics refactor, the Update_Reaction_*_Energies.py
scripts write only into each reaction's additive `thermodynamics` dict and no
longer populate the canonical top-level deltag/deltagerr. As a result 14,141
non-EMPTY reactions carried a perfectly good computed energy in `thermodynamics`
while their canonical deltag stayed the 10000000 sentinel (reversibility "?"),
so they read as thermodynamically undefined despite the value already existing.

New Scripts/Thermodynamics/Promote_Reaction_Thermodynamics_to_Canonical.py
re-aggregates those existing per-source estimates into the canonical fields. It
is pure re-aggregation -- no new estimation, no external dependencies:
- Only reactions whose canonical deltag is missing are touched; existing
  canonical values are never overwritten.
- Selection: prefer the mechanistic/measurement-anchored tier (eQuilibrator,
  then Group contribution) over the ML tier (dGPredictor-ModelSEED, dGPredictor);
  WITHIN the chosen tier take the lowest-uncertainty estimate. The within-tier
  lowest-error rule prevents a wildly-uncertain ML outlier (e.g. -100 +/- 71
  kcal/mol) from being promoted over a tight estimate (-8.6 +/- 0.04).
- Guards reject implausible magnitudes (|dG| > 1000 kcal/mol) and useless
  uncertainties (> 100 kcal/mol), leaving those reactions undefined rather than
  promoting garbage.
- deltagerr is set from the chosen source and reversibility is set to that
  estimate's own direction operator (same heuristic as Estimate_Reaction_
  Reversibility, already stored alongside each per-source energy).

Promoted 14,141 reactions: Group contribution 1,474; dGPredictor 8,635;
dGPredictor-ModelSEED 4,032. Verified: every promoted deltag equals one of the
reaction's stored per-source energies; zero pre-existing canonical values,
thermodynamics dicts, or other fields changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant