Promote per-source thermodynamics to canonical deltag (+ retrained dGPredictor-ModelSEED source)#265
Open
freiburgermsu wants to merge 2 commits into
Conversation
…urce Records the dGPredictor group-contribution model retrained on the ModelSEED compound structures as its own per-method entry, "dGPredictor-ModelSEED", in each reaction's `thermodynamics` dict. Purely additive: it sits next to the Group contribution / eQuilibrator / (original KEGG-based) dGPredictor records, and the original "dGPredictor" entry is left untouched. The canonical deltag / deltagerr / reversibility are not changed, and no .tsv or compound files change. - New staged predictions: Biochemistry/Thermodynamics/dGPredictor/ modelseed_retrained_dG.json (31,924 reactions, kJ/mol). - New writer: Scripts/Thermodynamics/Update_Reaction_dGPredictor_ModelSEED_ Energies.py (kJ->kcal /4.184; operator via reversibility_from_energy). - 31,924 reactions gain a dGPredictor-ModelSEED record (incl. ~11,400 the original KEGG-based dGPredictor could not reach); 24,088 reactions unchanged. - Verified: every modified reaction differs from dev ONLY by the added dGPredictor-ModelSEED key; added values equal dG_mean/4.184; the writer is idempotent. - Docs: sources.yaml, Scripts/Thermodynamics/README.md, Rerun_Thermodynamics.sh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…141 reactions) After the additive-thermodynamics refactor, the Update_Reaction_*_Energies.py scripts write only into each reaction's additive `thermodynamics` dict and no longer populate the canonical top-level deltag/deltagerr. As a result 14,141 non-EMPTY reactions carried a perfectly good computed energy in `thermodynamics` while their canonical deltag stayed the 10000000 sentinel (reversibility "?"), so they read as thermodynamically undefined despite the value already existing. New Scripts/Thermodynamics/Promote_Reaction_Thermodynamics_to_Canonical.py re-aggregates those existing per-source estimates into the canonical fields. It is pure re-aggregation -- no new estimation, no external dependencies: - Only reactions whose canonical deltag is missing are touched; existing canonical values are never overwritten. - Selection: prefer the mechanistic/measurement-anchored tier (eQuilibrator, then Group contribution) over the ML tier (dGPredictor-ModelSEED, dGPredictor); WITHIN the chosen tier take the lowest-uncertainty estimate. The within-tier lowest-error rule prevents a wildly-uncertain ML outlier (e.g. -100 +/- 71 kcal/mol) from being promoted over a tight estimate (-8.6 +/- 0.04). - Guards reject implausible magnitudes (|dG| > 1000 kcal/mol) and useless uncertainties (> 100 kcal/mol), leaving those reactions undefined rather than promoting garbage. - deltagerr is set from the chosen source and reversibility is set to that estimate's own direction operator (same heuristic as Estimate_Reaction_ Reversibility, already stored alongside each per-source energy). Promoted 14,141 reactions: Group contribution 1,474; dGPredictor 8,635; dGPredictor-ModelSEED 4,032. Verified: every promoted deltag equals one of the reaction's stored per-source energies; zero pre-existing canonical values, thermodynamics dicts, or other fields changed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gives 14,141 reactions a canonical free-energy value (
deltag/deltagerr/reversibility) that they were missing — purely by re-aggregating estimates that already exist in the additivethermodynamicsdict, plus the retrained dGPredictor source that supplies some of them.Background: the promotion gap
After the additive-thermodynamics refactor, the
Update_Reaction_*_Energies.pyscripts write only into each reaction's additivethermodynamicsdict and no longer populate the canonical top-leveldeltag/deltagerr. As a result thousands of reactions carried a perfectly good computed energy inthermodynamicswhile their canonicaldeltagstayed the10000000sentinel (andreversibility"?"), so they read as thermodynamically undefined despite the value already existing.What this PR does
Two commits:
dGPredictor-ModelSEEDsource (existing additive work) — the dGPredictor group-contribution model retrained on ModelSEED structures, recorded as its own per-method entry inthermodynamics. Additive; supplies energies for reactions the other sources miss.Promotion — new
Scripts/Thermodynamics/Promote_Reaction_Thermodynamics_to_Canonical.pyre-aggregates the stored per-source estimates into the canonical fields. Pure re-aggregation: no new estimation, no external dependencies.deltagare touched; existing canonical values are never overwritten.eQuilibrator→Group contribution) over the ML tier (dGPredictor-ModelSEED,dGPredictor); within the chosen tier take the lowest-uncertainty estimate. The within-tier lowest-error rule stops a wildly-uncertain ML outlier (e.g.-100 ± 71kcal/mol) being promoted over a tight estimate (-8.6 ± 0.04).|dG| > 1000) and useless uncertainties (> 100kcal/mol), leaving those undefined rather than promoting garbage.reversibilityis set to the chosen estimate's own direction operator (same heuristic asEstimate_Reaction_Reversibility.py, already stored with each per-source energy).Result
14,141 reactions promoted: Group contribution 1,474 · dGPredictor 8,635 · dGPredictor-ModelSEED 4,032.
Verified: every promoted
deltagequals one of that reaction's stored per-source energies; zero pre-existing canonical values,thermodynamicsdicts, or other fields changed. (.tsvfiles update becausedeltag/deltagerr/reversibilityare TSV columns.)The source-precedence policy is a single editable
TIERSconstant; ~732 reactions have >50 kcal/mol cross-source disagreement and are worth a curator spot-check (the policy resolves them by tier + lowest error, not averaging).🤖 Generated with Claude Code