iSamples export: carry keyword/descriptor concept URIs (Getty AAT, UBERON…) as IdentifiedConcept — export narrows `keywords` to text, dropping URIs (blocks #256)

**Bounded ask** (for Eric / OpenContext + whoever owns the export client) to unblock Flavor B concept-URI search (#256). This is **not** a schema redesign — the iSamples model already supports it; the export currently throws the URIs away.

## The gap (verified in code + data)

The iSamples model already defines keyword/descriptor concepts as full `IdentifiedConcept`s — `IdentifiedConcept` requires `pid`, `label`, `scheme_name`, `scheme_uri` (`isamplesorg-metadata/src/schemas/isamples_core.yaml:151`). And `keywords`, `has_context_category`, `has_material_category`, `has_sample_object_type` all point to `IdentifiedConcept`.

**But the export client narrows `keywords` to text only:**
```
# export_client/isamples_export_client/duckdb_utilities.py:17
keywords: 'STRUCT(keyword VARCHAR)[]',
```
→ the IdentifiedConcept's `pid` (the URI) + `scheme_uri` are discarded; only the `label` text survives. So in the exported GeoParquet, `keywords` is `STRUCT(keyword VARCHAR)[]` (no identifier), and the external descriptor URIs (Getty AAT, UBERON/OBO) appear in **no field we consume**.

**Where it's lost:** before or during the export mapping — specifically the `keywords` coercion above, and/or the upstream source/index not populating the IdentifiedConcept URIs for these annotations. (Not lost in PQG or the frontend — verified absent there too.)

**Evidence (export probe + PQG):**
- The 4 example URIs resolve to **0** IdentifiedConcept nodes in PQG; PQG *does* carry rich concept edges (keywords 15.06M, has_material_category 8.30M, has_sample_object_type 7.65M, has_context_category 6.53M; 55,893 concepts) — just not these external URIs.
- In the export: category fields carry only iSamples controlled-vocab URIs; `keywords` is text-only; the 4 URIs appear nowhere (bucchero/tibia/femur/whorls).
- OpenContext clearly has the annotations — its API resolves `obj={URI}` (e.g. UBERON tibia; see https://opencontext.org/about/services).

Eric's guide concepts: `getty/aat/300387149` (bucchero), `UBERON_0000979` (tibia), `UBERON_0000981` (femur), `getty/aat/300263796` (whorls).

## The ask

Regenerate the export so keyword/descriptor concepts are carried as **full `IdentifiedConcept`s**, not flattened to label text. Concretely:

1. **Stop narrowing `keywords` to `STRUCT(keyword VARCHAR)[]`** — carry the IdentifiedConcept fields: `pid` (the concept URI, e.g. `https://vocab.getty.edu/aat/300387149`), `label`, `scheme_name`, `scheme_uri`. (Export-convention field names are fine; the point is the URI must survive.)
2. **Ensure the source/index actually populates those concept URIs** for OpenContext's external subject/descriptor annotations (the ones already queryable via `obj={URI}`). If the URIs aren't in what the export reads, that's the deeper fix.
3. **Keep both** the existing iSamples controlled-vocab category URIs **and** the original external annotations — they serve different queries. Don't force external descriptor URIs into the typed `has_material_category`/`has_context_category`/`has_sample_object_type` slots automatically (those are semantically typed top-level categories); put external subject/descriptor concepts in **`keywords`**, and only populate a category slot where the mapping genuinely says the concept is that category.

### Data we need (per concept)
- **Required:** concept **URI**, preferred **label**, the **relation** it came in on (keywords vs has_*_category), **scheme_name/scheme_uri** if known.
- **Required:** preserve **exact URI strings** — incl. `http` vs `https` and CURIE-vs-full-URI for OBO (e.g. keep `https://purl.obolibrary.org/obo/UBERON_0000979` exactly).
- **Nice-to-have (do NOT block the first regeneration):** altLabels, language tags, broader/narrower hierarchy, provenance of the source OpenContext predicate.

## Acceptance criteria
- **Exact-URI lookup** in the regenerated export (and resulting PQG) returns **nonzero linked records** for each of the four `obj={URI}` queries, with counts **reconciled against OpenContext's `obj=` results** (not against lexical/free-text matches).
- (Context only: ~2,693 'bucchero' / ~16,577 'tibia' currently match via free-text in label/description — that is *not* URI-linked evidence; it just shows the concepts are present in the corpus.)

## Downstream (our side)
Once the URIs are in the export → PQG carries them as `IdentifiedConcept` nodes → we build `sample_concepts.parquet` (`pid, source, concept_uri, concept_label, relation_type`; design: `pqg/docs/FLAVOR_B_CONCEPT_INDEX_DESIGN.md`) → the explorer's `described-by=<uri>` semi-joins it for cross-domain concept-URI search (#256), with the 4 queries as acceptance tests.

**No hard dependency to start:** we can build the projection + `described-by` rail now against the **55,893 iSamples-vocab IdentifiedConcepts already in the data** (~37.5M relation rows pre-dedupe); the external (Getty/UBERON) URIs light up once the regenerated export lands.

cc Eric (OpenContext). Realistic timeframe: weeks.

---
*Refined after a code-grounded review: confirmed the `keywords` narrowing in `duckdb_utilities.py:17` and the `IdentifiedConcept` slots in `isamples_core.yaml:151`.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iSamples export: carry keyword/descriptor concept URIs (Getty AAT, UBERON…) as IdentifiedConcept — export narrows `keywords` to text, dropping URIs (blocks #256) #263

The gap (verified in code + data)

The ask

Data we need (per concept)

Acceptance criteria

Downstream (our side)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

iSamples export: carry keyword/descriptor concept URIs (Getty AAT, UBERON…) as IdentifiedConcept — export narrows keywords to text, dropping URIs (blocks #256) #263

Description

The gap (verified in code + data)

The ask

Data we need (per concept)

Acceptance criteria

Downstream (our side)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

iSamples export: carry keyword/descriptor concept URIs (Getty AAT, UBERON…) as IdentifiedConcept — export narrows `keywords` to text, dropping URIs (blocks #256) #263