Schema: per-taxon abundances + reusable CommonTaxon gene records#161
Merged
Conversation
Two schema-first additions (datamodel regenerated via `just gen-python`):
1. Separate, optional abundance fields on TaxonomicComposition:
- absolute_abundance (float) + absolute_abundance_unit
- relative_abundance (float, >=0) + relative_abundance_unit
Both independent and optional; the legacy free-text abundance_value is kept
for back-compat and marked deprecated.
2. Reusable per-taxon gene records (new classes CommonTaxon / GenomeRecord /
GeneAnnotation), stored as standalone YAML under kb/taxa/ and referenced from
taxonomy[].common_taxon, so a curated genome + gene set for an organism can be
shared across community records instead of duplicated.
- taxa: NCBITaxon (TaxonDescriptor, id-label gated)
- genome: NCBI Assembly accession (GCF_/GCA_, pattern-validated)
- gene: standardized gene_id CURIE (NCBIGene / UniProtKB / KEGG) + optional
locus_tag, kegg_ortholog (KO), GO function terms, supported roles, and
evidence. Records link to communities via a dedicated id
(CommunityMech:taxon:NNNNNN).
Wiring + examples:
- `just validate-taxa` validates kb/taxa/ against CommonTaxon (--target-class).
- conf/id_label_targets.yaml: new `taxa_yaml` target so NCBITaxon/GO terms in
kb/taxa/ are canonical-checked by the blocking validate-products gate.
- kb/taxa/: two example records (Shewanella oneidensis MR-1 Mtr pathway;
Geobacter sulfurreducens conductive-pili/cytochromes) — OAK-verified terms,
real assemblies/locus tags; gene EvidenceItems omitted rather than fabricated.
- Demonstrated common_taxon links in the Shewanella-Geobacter exoelectrogenic
community.
- tests/test_taxon_records.py (5 tests): records validate, schema slots/classes
present, community linkage, separate-abundance smoke test.
Verified: `just validate-taxa`, validate-strict, `just validate-products`
(5505 OK_CANONICAL, 0 errors), `just lint`, and full pytest (193 passed) all pass.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Schema-first expansion (two requested capabilities).
1. Separate community-member abundances
TaxonomicCompositiongains four optional, independent fields:absolute_abundance(float) +absolute_abundance_unit(e.g. cells/mL, reads)relative_abundance(float ≥ 0) +relative_abundance_unit(default: fraction 0–1)The legacy free-text
abundance_valueis kept for back-compat and marked deprecated (it was being misused for role text).2. Reusable per-taxon gene records (
kb/taxa/)New classes CommonTaxon / GenomeRecord / GeneAnnotation. Each
kb/taxa/record holds an NCBITaxon-grounded organism + reference genome(s) + the genes that support its community role/interactions, and is referenced fromtaxonomy[].common_taxonso the curated gene set is shared across communities instead of duplicated.Standardized IDs (per your choices):
GCF_/GCA_(pattern-validated)gene_idCURIE accepting NCBIGene / UniProtKB / KEGG, plus optionallocus_tag,kegg_ortholog(KO), GO functiongo_terms,supports_roles,supports_interaction, evidenceCommunityMech:taxon:NNNNNNWiring + examples
just validate-taxa(linkml-validate--target-class CommonTaxon)conf/id_label_targets.yaml: newtaxa_yamltarget → NCBITaxon/GO terms inkb/taxa/are canonical-checked by the blockingvalidate-productsgateEvidenceItems omitted rather than fabricatedcommon_taxonlinks demonstrated in the Shewanella–Geobacter exoelectrogenic communitytests/test_taxon_records.py(5 tests)Verification
just gen-python(datamodel regenerated)just validate-taxa,just validate-all/strict,just validate-products(5505 OK_CANONICAL, 0 errors)just lint(black/ruff/mypy) clean; 193 pytest passed🤖 Generated with Claude Code