Skip to content

Schema: per-taxon abundances + reusable CommonTaxon gene records#161

Merged
realmarcin merged 1 commit into
mainfrom
feat/taxon-abundance-and-gene-records
Jun 18, 2026
Merged

Schema: per-taxon abundances + reusable CommonTaxon gene records#161
realmarcin merged 1 commit into
mainfrom
feat/taxon-abundance-and-gene-records

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

Schema-first expansion (two requested capabilities).

1. Separate community-member abundances

TaxonomicComposition gains four optional, independent fields:

  • absolute_abundance (float) + absolute_abundance_unit (e.g. cells/mL, reads)
  • relative_abundance (float ≥ 0) + relative_abundance_unit (default: fraction 0–1)

The legacy free-text abundance_value is kept for back-compat and marked deprecated (it was being misused for role text).

2. Reusable per-taxon gene records (kb/taxa/)

New classes CommonTaxon / GenomeRecord / GeneAnnotation. Each kb/taxa/ record holds an NCBITaxon-grounded organism + reference genome(s) + the genes that support its community role/interactions, and is referenced from taxonomy[].common_taxon so the curated gene set is shared across communities instead of duplicated.

Standardized IDs (per your choices):

  • taxa → NCBITaxon (id-label gated)
  • genome → NCBI Assembly GCF_/GCA_ (pattern-validated)
  • gene → gene_id CURIE accepting NCBIGene / UniProtKB / KEGG, plus optional locus_tag, kegg_ortholog (KO), GO function go_terms, supports_roles, supports_interaction, evidence
  • link → dedicated record id CommunityMech:taxon:NNNNNN

Wiring + examples

  • just validate-taxa (linkml-validate --target-class CommonTaxon)
  • conf/id_label_targets.yaml: new taxa_yaml target → NCBITaxon/GO terms in kb/taxa/ are canonical-checked by the blocking validate-products gate
  • 2 example records: S. oneidensis MR-1 (Mtr pathway) and G. sulfurreducens (conductive pili + multiheme cytochromes) — OAK-verified terms, real assemblies/locus tags; gene EvidenceItems omitted rather than fabricated
  • common_taxon links demonstrated in the Shewanella–Geobacter exoelectrogenic community
  • tests/test_taxon_records.py (5 tests)

Verification

  • just gen-python (datamodel regenerated)
  • just validate-taxa, just validate-all/strict, just validate-products (5505 OK_CANONICAL, 0 errors)
  • just lint (black/ruff/mypy) clean; 193 pytest passed

🤖 Generated with Claude Code

Two schema-first additions (datamodel regenerated via `just gen-python`):

1. Separate, optional abundance fields on TaxonomicComposition:
   - absolute_abundance (float) + absolute_abundance_unit
   - relative_abundance (float, >=0) + relative_abundance_unit
   Both independent and optional; the legacy free-text abundance_value is kept
   for back-compat and marked deprecated.

2. Reusable per-taxon gene records (new classes CommonTaxon / GenomeRecord /
   GeneAnnotation), stored as standalone YAML under kb/taxa/ and referenced from
   taxonomy[].common_taxon, so a curated genome + gene set for an organism can be
   shared across community records instead of duplicated.
   - taxa: NCBITaxon (TaxonDescriptor, id-label gated)
   - genome: NCBI Assembly accession (GCF_/GCA_, pattern-validated)
   - gene: standardized gene_id CURIE (NCBIGene / UniProtKB / KEGG) + optional
     locus_tag, kegg_ortholog (KO), GO function terms, supported roles, and
     evidence. Records link to communities via a dedicated id
     (CommunityMech:taxon:NNNNNN).

Wiring + examples:
- `just validate-taxa` validates kb/taxa/ against CommonTaxon (--target-class).
- conf/id_label_targets.yaml: new `taxa_yaml` target so NCBITaxon/GO terms in
  kb/taxa/ are canonical-checked by the blocking validate-products gate.
- kb/taxa/: two example records (Shewanella oneidensis MR-1 Mtr pathway;
  Geobacter sulfurreducens conductive-pili/cytochromes) — OAK-verified terms,
  real assemblies/locus tags; gene EvidenceItems omitted rather than fabricated.
- Demonstrated common_taxon links in the Shewanella-Geobacter exoelectrogenic
  community.
- tests/test_taxon_records.py (5 tests): records validate, schema slots/classes
  present, community linkage, separate-abundance smoke test.

Verified: `just validate-taxa`, validate-strict, `just validate-products`
(5505 OK_CANONICAL, 0 errors), `just lint`, and full pytest (193 passed) all pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 862cfe1 into main Jun 18, 2026
4 checks passed
@realmarcin realmarcin deleted the feat/taxon-abundance-and-gene-records branch June 18, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant