Skip to content

iSamples export: carry keyword/descriptor concept URIs (Getty AAT, UBERON…) as IdentifiedConcept — export narrows keywords to text, dropping URIs (blocks #256) #263

@rdhyee

Description

@rdhyee

Bounded ask (for Eric / OpenContext + whoever owns the export client) to unblock Flavor B concept-URI search (#256). This is not a schema redesign — the iSamples model already supports it; the export currently throws the URIs away.

The gap (verified in code + data)

The iSamples model already defines keyword/descriptor concepts as full IdentifiedConcepts — IdentifiedConcept requires pid, label, scheme_name, scheme_uri (isamplesorg-metadata/src/schemas/isamples_core.yaml:151). And keywords, has_context_category, has_material_category, has_sample_object_type all point to IdentifiedConcept.

But the export client narrows keywords to text only:

# export_client/isamples_export_client/duckdb_utilities.py:17
keywords: 'STRUCT(keyword VARCHAR)[]',

→ the IdentifiedConcept's pid (the URI) + scheme_uri are discarded; only the label text survives. So in the exported GeoParquet, keywords is STRUCT(keyword VARCHAR)[] (no identifier), and the external descriptor URIs (Getty AAT, UBERON/OBO) appear in no field we consume.

Where it's lost: before or during the export mapping — specifically the keywords coercion above, and/or the upstream source/index not populating the IdentifiedConcept URIs for these annotations. (Not lost in PQG or the frontend — verified absent there too.)

Evidence (export probe + PQG):

  • The 4 example URIs resolve to 0 IdentifiedConcept nodes in PQG; PQG does carry rich concept edges (keywords 15.06M, has_material_category 8.30M, has_sample_object_type 7.65M, has_context_category 6.53M; 55,893 concepts) — just not these external URIs.
  • In the export: category fields carry only iSamples controlled-vocab URIs; keywords is text-only; the 4 URIs appear nowhere (bucchero/tibia/femur/whorls).
  • OpenContext clearly has the annotations — its API resolves obj={URI} (e.g. UBERON tibia; see https://opencontext.org/about/services).

Eric's guide concepts: getty/aat/300387149 (bucchero), UBERON_0000979 (tibia), UBERON_0000981 (femur), getty/aat/300263796 (whorls).

The ask

Regenerate the export so keyword/descriptor concepts are carried as full IdentifiedConcepts, not flattened to label text. Concretely:

  1. Stop narrowing keywords to STRUCT(keyword VARCHAR)[] — carry the IdentifiedConcept fields: pid (the concept URI, e.g. https://vocab.getty.edu/aat/300387149), label, scheme_name, scheme_uri. (Export-convention field names are fine; the point is the URI must survive.)
  2. Ensure the source/index actually populates those concept URIs for OpenContext's external subject/descriptor annotations (the ones already queryable via obj={URI}). If the URIs aren't in what the export reads, that's the deeper fix.
  3. Keep both the existing iSamples controlled-vocab category URIs and the original external annotations — they serve different queries. Don't force external descriptor URIs into the typed has_material_category/has_context_category/has_sample_object_type slots automatically (those are semantically typed top-level categories); put external subject/descriptor concepts in keywords, and only populate a category slot where the mapping genuinely says the concept is that category.

Data we need (per concept)

  • Required: concept URI, preferred label, the relation it came in on (keywords vs has_*_category), scheme_name/scheme_uri if known.
  • Required: preserve exact URI strings — incl. http vs https and CURIE-vs-full-URI for OBO (e.g. keep https://purl.obolibrary.org/obo/UBERON_0000979 exactly).
  • Nice-to-have (do NOT block the first regeneration): altLabels, language tags, broader/narrower hierarchy, provenance of the source OpenContext predicate.

Acceptance criteria

  • Exact-URI lookup in the regenerated export (and resulting PQG) returns nonzero linked records for each of the four obj={URI} queries, with counts reconciled against OpenContext's obj= results (not against lexical/free-text matches).
  • (Context only: ~2,693 'bucchero' / ~16,577 'tibia' currently match via free-text in label/description — that is not URI-linked evidence; it just shows the concepts are present in the corpus.)

Downstream (our side)

Once the URIs are in the export → PQG carries them as IdentifiedConcept nodes → we build sample_concepts.parquet (pid, source, concept_uri, concept_label, relation_type; design: pqg/docs/FLAVOR_B_CONCEPT_INDEX_DESIGN.md) → the explorer's described-by=<uri> semi-joins it for cross-domain concept-URI search (#256), with the 4 queries as acceptance tests.

No hard dependency to start: we can build the projection + described-by rail now against the 55,893 iSamples-vocab IdentifiedConcepts already in the data (~37.5M relation rows pre-dedupe); the external (Getty/UBERON) URIs light up once the regenerated export lands.

cc Eric (OpenContext). Realistic timeframe: weeks.


Refined after a code-grounded review: confirmed the keywords narrowing in duckdb_utilities.py:17 and the IdentifiedConcept slots in isamples_core.yaml:151.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions