Bounded ask (for Eric / OpenContext + whoever owns the export client) to unblock Flavor B concept-URI search (#256). This is not a schema redesign — the iSamples model already supports it; the export currently throws the URIs away.
The gap (verified in code + data)
The iSamples model already defines keyword/descriptor concepts as full IdentifiedConcepts — IdentifiedConcept requires pid, label, scheme_name, scheme_uri (isamplesorg-metadata/src/schemas/isamples_core.yaml:151). And keywords, has_context_category, has_material_category, has_sample_object_type all point to IdentifiedConcept.
But the export client narrows keywords to text only:
# export_client/isamples_export_client/duckdb_utilities.py:17
keywords: 'STRUCT(keyword VARCHAR)[]',
→ the IdentifiedConcept's pid (the URI) + scheme_uri are discarded; only the label text survives. So in the exported GeoParquet, keywords is STRUCT(keyword VARCHAR)[] (no identifier), and the external descriptor URIs (Getty AAT, UBERON/OBO) appear in no field we consume.
Where it's lost: before or during the export mapping — specifically the keywords coercion above, and/or the upstream source/index not populating the IdentifiedConcept URIs for these annotations. (Not lost in PQG or the frontend — verified absent there too.)
Evidence (export probe + PQG):
- The 4 example URIs resolve to 0 IdentifiedConcept nodes in PQG; PQG does carry rich concept edges (keywords 15.06M, has_material_category 8.30M, has_sample_object_type 7.65M, has_context_category 6.53M; 55,893 concepts) — just not these external URIs.
- In the export: category fields carry only iSamples controlled-vocab URIs;
keywords is text-only; the 4 URIs appear nowhere (bucchero/tibia/femur/whorls).
- OpenContext clearly has the annotations — its API resolves
obj={URI} (e.g. UBERON tibia; see https://opencontext.org/about/services).
Eric's guide concepts: getty/aat/300387149 (bucchero), UBERON_0000979 (tibia), UBERON_0000981 (femur), getty/aat/300263796 (whorls).
The ask
Regenerate the export so keyword/descriptor concepts are carried as full IdentifiedConcepts, not flattened to label text. Concretely:
- Stop narrowing
keywords to STRUCT(keyword VARCHAR)[] — carry the IdentifiedConcept fields: pid (the concept URI, e.g. https://vocab.getty.edu/aat/300387149), label, scheme_name, scheme_uri. (Export-convention field names are fine; the point is the URI must survive.)
- Ensure the source/index actually populates those concept URIs for OpenContext's external subject/descriptor annotations (the ones already queryable via
obj={URI}). If the URIs aren't in what the export reads, that's the deeper fix.
- Keep both the existing iSamples controlled-vocab category URIs and the original external annotations — they serve different queries. Don't force external descriptor URIs into the typed
has_material_category/has_context_category/has_sample_object_type slots automatically (those are semantically typed top-level categories); put external subject/descriptor concepts in keywords, and only populate a category slot where the mapping genuinely says the concept is that category.
Data we need (per concept)
- Required: concept URI, preferred label, the relation it came in on (keywords vs has_*_category), scheme_name/scheme_uri if known.
- Required: preserve exact URI strings — incl.
http vs https and CURIE-vs-full-URI for OBO (e.g. keep https://purl.obolibrary.org/obo/UBERON_0000979 exactly).
- Nice-to-have (do NOT block the first regeneration): altLabels, language tags, broader/narrower hierarchy, provenance of the source OpenContext predicate.
Acceptance criteria
- Exact-URI lookup in the regenerated export (and resulting PQG) returns nonzero linked records for each of the four
obj={URI} queries, with counts reconciled against OpenContext's obj= results (not against lexical/free-text matches).
- (Context only: ~2,693 'bucchero' / ~16,577 'tibia' currently match via free-text in label/description — that is not URI-linked evidence; it just shows the concepts are present in the corpus.)
Downstream (our side)
Once the URIs are in the export → PQG carries them as IdentifiedConcept nodes → we build sample_concepts.parquet (pid, source, concept_uri, concept_label, relation_type; design: pqg/docs/FLAVOR_B_CONCEPT_INDEX_DESIGN.md) → the explorer's described-by=<uri> semi-joins it for cross-domain concept-URI search (#256), with the 4 queries as acceptance tests.
No hard dependency to start: we can build the projection + described-by rail now against the 55,893 iSamples-vocab IdentifiedConcepts already in the data (~37.5M relation rows pre-dedupe); the external (Getty/UBERON) URIs light up once the regenerated export lands.
cc Eric (OpenContext). Realistic timeframe: weeks.
Refined after a code-grounded review: confirmed the keywords narrowing in duckdb_utilities.py:17 and the IdentifiedConcept slots in isamples_core.yaml:151.
Bounded ask (for Eric / OpenContext + whoever owns the export client) to unblock Flavor B concept-URI search (#256). This is not a schema redesign — the iSamples model already supports it; the export currently throws the URIs away.
The gap (verified in code + data)
The iSamples model already defines keyword/descriptor concepts as full
IdentifiedConcepts —IdentifiedConceptrequirespid,label,scheme_name,scheme_uri(isamplesorg-metadata/src/schemas/isamples_core.yaml:151). Andkeywords,has_context_category,has_material_category,has_sample_object_typeall point toIdentifiedConcept.But the export client narrows
keywordsto text only:→ the IdentifiedConcept's
pid(the URI) +scheme_uriare discarded; only thelabeltext survives. So in the exported GeoParquet,keywordsisSTRUCT(keyword VARCHAR)[](no identifier), and the external descriptor URIs (Getty AAT, UBERON/OBO) appear in no field we consume.Where it's lost: before or during the export mapping — specifically the
keywordscoercion above, and/or the upstream source/index not populating the IdentifiedConcept URIs for these annotations. (Not lost in PQG or the frontend — verified absent there too.)Evidence (export probe + PQG):
keywordsis text-only; the 4 URIs appear nowhere (bucchero/tibia/femur/whorls).obj={URI}(e.g. UBERON tibia; see https://opencontext.org/about/services).Eric's guide concepts:
getty/aat/300387149(bucchero),UBERON_0000979(tibia),UBERON_0000981(femur),getty/aat/300263796(whorls).The ask
Regenerate the export so keyword/descriptor concepts are carried as full
IdentifiedConcepts, not flattened to label text. Concretely:keywordstoSTRUCT(keyword VARCHAR)[]— carry the IdentifiedConcept fields:pid(the concept URI, e.g.https://vocab.getty.edu/aat/300387149),label,scheme_name,scheme_uri. (Export-convention field names are fine; the point is the URI must survive.)obj={URI}). If the URIs aren't in what the export reads, that's the deeper fix.has_material_category/has_context_category/has_sample_object_typeslots automatically (those are semantically typed top-level categories); put external subject/descriptor concepts inkeywords, and only populate a category slot where the mapping genuinely says the concept is that category.Data we need (per concept)
httpvshttpsand CURIE-vs-full-URI for OBO (e.g. keephttps://purl.obolibrary.org/obo/UBERON_0000979exactly).Acceptance criteria
obj={URI}queries, with counts reconciled against OpenContext'sobj=results (not against lexical/free-text matches).Downstream (our side)
Once the URIs are in the export → PQG carries them as
IdentifiedConceptnodes → we buildsample_concepts.parquet(pid, source, concept_uri, concept_label, relation_type; design:pqg/docs/FLAVOR_B_CONCEPT_INDEX_DESIGN.md) → the explorer'sdescribed-by=<uri>semi-joins it for cross-domain concept-URI search (#256), with the 4 queries as acceptance tests.No hard dependency to start: we can build the projection +
described-byrail now against the 55,893 iSamples-vocab IdentifiedConcepts already in the data (~37.5M relation rows pre-dedupe); the external (Getty/UBERON) URIs light up once the regenerated export lands.cc Eric (OpenContext). Realistic timeframe: weeks.
Refined after a code-grounded review: confirmed the
keywordsnarrowing induckdb_utilities.py:17and theIdentifiedConceptslots inisamples_core.yaml:151.