Skip to content

Backfill related_ingredients via NCBI BioC full-text re-fetch (hCom2, Cheese_Rind) (#30)#158

Merged
realmarcin merged 1 commit into
mainfrom
backfill/related-ingredients-bioc-refetch
Jun 18, 2026
Merged

Backfill related_ingredients via NCBI BioC full-text re-fetch (hCom2, Cheese_Rind) (#30)#158
realmarcin merged 1 commit into
mainfrom
backfill/related-ingredients-bioc-refetch

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

Option 1 of the plan: retry the 4 papers that Europe PMC OA returned 404 for, via the NCBI BioC PMC API. 2 of 4 recovered full text; folded into the canonical PMID_<id>.md caches and backfilled. 16 new CHEBI-grounded ingredients.

File new from full text
hCom2_Complex_Gut_Microbiome 10 arginine deiminase pathway (arginine→ornithine+CO2+ammonium), AA-utilization set (methionine/histidine/isoleucine/valine/tyrosine), bile acid
Cheese_Rind 6 BCAA degradation (valine/leucine/isoleucine), cysteine/methionine metabolism, methanethiol

Discipline held under hand-authoring (subagent API was unstable): excluded Sigma reagent-catalog hits (acetate/formate as "Sodium acetate Sigma…"), a bibliography "butyrate" reference title, and generic class terms.

Still unrecoverable (not in BioC PMC either): Maize_Root (PMC5373366), Infant_Gut_Phage (PMC11156429).

Verification

  • 16/16 labels OAK-canonical; 16/16 snippets exact substrings of the enriched caches
  • linkml-validate both → exit 0
  • just validate-products (blocking gate) → exit 0 (5497 OK_CANONICAL)

Adoption: 215 → 217 / 265 (hCom2 + Cheese_Rind newly populated).

🤖 Generated with Claude Code

… Cheese_Rind) (#30)

Option-1 follow-up: the 4 papers that Europe PMC OA couldn't serve were retried
via the NCBI BioC PMC API (bionlp/RESTful/pmcoa.cgi/BioC_json). 2 of 4 recovered;
folded their full text into the canonical PMID_<id>.md caches and backfilled.
16 new CHEBI-grounded ingredients (hand-authored — the subagent API was
unstable; same strict protocol, all snippets verbatim from the enriched caches):

- hCom2_Complex_Gut_Microbiome (PMID_36070752.md created from full text): 10 —
  arginine deiminase pathway (arginine→ornithine + CO2 + ammonium) + community
  amino-acid utilization set (methionine, histidine, isoleucine, valine,
  tyrosine) + bile acid. Reagent-catalog hits (Sigma acetate/formate) and a
  reference-title "butyrate" correctly excluded.
- Cheese_Rind (PMID_25036636.md enriched): 6 — branched-chain AA degradation
  (valine, leucine, isoleucine), cysteine/methionine metabolism, and methanethiol
  (the volatile sulfur aroma product).

Still unrecoverable (not in BioC PMC either): Maize_Root (PMC5373366),
Infant_Gut_Phage (PMC11156429).

Verified: 16/16 labels canonical, 16/16 snippets exact substrings, both pass
linkml-validate, `just validate-products` (blocking gate) exits 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 9b68f4a into main Jun 18, 2026
3 checks passed
@realmarcin realmarcin deleted the backfill/related-ingredients-bioc-refetch branch June 18, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant