Summary
Today the explorer can filter samples by source / material / context / object_type and by free-text search, but it has no notion of a "collection" — a curated, named grouping such as an OpenContext project (e.g. PKAP — Pyla-Koutsopetria Archaeological Project) or a SESAR campaign. Users naturally want to say "show me PKAP" and then layer material/context facets on top of that collection. This issue captures the data analysis behind that gap and proposes a two-phase implementation.
Why "show me PKAP" doesn't work today
PKAP identity does not live on the MaterialSampleRecord rows the explorer queries. It lives on the SamplingSite entity, reachable only by multi-hop traversal:
MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite(label="PKAP Survey Area")
Evidence from /current/wide.parquet (snapshot 202604):
| Where "PKAP" appears |
otype |
count |
label = "PKAP Survey Area" |
SamplingSite |
1,336 |
| (via traversal) |
SamplingEvent |
8,169 |
in sample label |
MaterialSampleRecord |
only 166 |
PKAP sample labels look like 4063-34, 4111-PK-10 — no "PKAP" string on the sample row. Consequences:
- Text search
PKAP matches only ~166 samples (those with it in the label) out of the 15,446 that actually belong to the collection (confirmed by traversal). Unreliable.
- Geographic deep-link (center on
lat=34.987406, lng=33.708047 + sources=OPENCONTEXT + search_scope=area) is location-based, not identity-based: it catches nearby non-PKAP samples and misses any PKAP sample plotted elsewhere.
So there is currently no way to express a collection view through the UI.
How a "view" works today (for reference)
Per EXPLORER_STATE.md, a view is fully encoded in the URL:
?query params (data/filter state): search, sources (CSV of SESAR,OPENCONTEXT,GEOME,SMITHSONIAN), material / context / object_type (CSV of vocabulary URIs), search_scope (area).
#hash (camera/selection): lat/lng/alt/heading/pitch, mode (cluster|point), pid, h3.
Facets are additive AND-filters; counts cross-filter. Caveat: in cluster (zoomed-out) mode, material/context/object_type filters do not affect the dots — H3 cluster parquets only carry dominant_source. Those facets only "bite" at neighborhood/point zoom (the existing #facetNote). A collection facet would inherit the same constraint unless collection is also added to the H3 summaries.
Data feasibility analysis
Probed /current/wide.parquet:
| Finding |
Value |
Implication |
Distinct SamplingSite labels |
60,268 |
Too many for a flat checkbox list — needs search/autocomplete or a curated high-count subset |
| Samples with a site label |
1.63M of 6.35M (~26%) |
collection is a sparse facet — mostly OpenContext-style data; most SESAR/GEOME/Smithsonian samples have no site |
| PKAP samples (via traversal) |
15,446 |
vs. 166 by text search |
| Traversal join cost |
~0.2s (after column load) |
Cheap to precompute; expensive part is scanning the p__* array columns |
Top collections by sample count (the "useful collections" users want):
Çatalhöyük (145,884), Petra Great Temple (108,846), Polis Chrysochous (52,252), Kenan Tepe (42,294), Poggio Civitate (41,679), Ilıpınar (36,947), Čḯxwicən (29,793), Heit el-Ghurab (28,940), Domuztepe (22,394), Emden (20,238).
Verdict: precompute, don't traverse live
Doing the Sample→Event→Site array-join live in DuckDB-WASM per facet interaction is not viable — it's exactly the list_contains/array-join pattern flagged as the in-browser bottleneck in the Dec-2025 query profiling. The fix is to bake the collection onto samples at build time, reusing existing supplementary files rather than adding a heavyweight new per-sample file:
- Add a
collection / site_label (+ site_id) column to the existing sample_facets_v2.parquet during the build (dictionary-encoded over 60K values → negligible size bump). The explorer then filters by collection through the same facet machinery as material — no live traversal, no new fetch.
- Add one small
collections.parquet dimension (~60K rows: site_id, label, source, n_samples, centroid_lat, centroid_lng, bbox). Powers (a) Option B preset URLs (centroid → camera) and (b) a future "browse collections" picker.
Proposed implementation (two phases)
Phase B — curated collection presets (quick win, no rebuild)
A small "Collections" page of named canned URLs (camera + sources + search_scope=area, optionally search). Ships immediately to demo the concept; geographically approximate (see flaws above). Good for the top ~10–20 collections.
Phase A — collection as a first-class dimension (the real fix)
- Build step: compute
site_label/site_id per sample via the traversal; write into sample_facets_v2 (+ samples_map_lite for display); emit collections.parquet dimension.
- Explorer: add a
collection facet wired exactly like material (URL param ?collection=..., applyQueryToFacetFilters, writeQueryState, cross-filter counts) per the EXPLORER_STATE.md contract. Because 60K values is too many for a flat list, render as a searchable/autocomplete facet (or scope the checkbox list to sites with ≥ N samples, full list searchable).
- Optional: add
collection/site_id to the H3 summary parquets so the facet also bites in cluster mode (otherwise it behaves like material — neighborhood-zoom only).
Open questions
- Facet UX for 60K collections: searchable autocomplete vs. curated high-count subset vs. both?
- Should
collection be added to the H3 summaries (cluster-mode honesty) or accept neighborhood-zoom-only like the other facets?
- Naming/identity: facet on
site_label (human-readable, but collisions possible) vs. site_id (stable, needs a label join via collections.parquet)?
- Do we surface a dedicated "Collections" landing/browse page, or only the facet?
Analysis performed against https://data.isamples.org/current/wide.parquet (snapshot 202604). Companion local scripts: pkap_samples.py, find_pkap_geos.py.
Summary
Today the explorer can filter samples by source / material / context / object_type and by free-text search, but it has no notion of a "collection" — a curated, named grouping such as an OpenContext project (e.g. PKAP — Pyla-Koutsopetria Archaeological Project) or a SESAR campaign. Users naturally want to say "show me PKAP" and then layer material/context facets on top of that collection. This issue captures the data analysis behind that gap and proposes a two-phase implementation.
Why "show me PKAP" doesn't work today
PKAP identity does not live on the
MaterialSampleRecordrows the explorer queries. It lives on theSamplingSiteentity, reachable only by multi-hop traversal:Evidence from
/current/wide.parquet(snapshot 202604):label = "PKAP Survey Area"labelPKAP sample labels look like
4063-34,4111-PK-10— no "PKAP" string on the sample row. Consequences:PKAPmatches only ~166 samples (those with it in the label) out of the 15,446 that actually belong to the collection (confirmed by traversal). Unreliable.lat=34.987406, lng=33.708047+sources=OPENCONTEXT+search_scope=area) is location-based, not identity-based: it catches nearby non-PKAP samples and misses any PKAP sample plotted elsewhere.So there is currently no way to express a collection view through the UI.
How a "view" works today (for reference)
Per
EXPLORER_STATE.md, a view is fully encoded in the URL:?queryparams (data/filter state):search,sources(CSV ofSESAR,OPENCONTEXT,GEOME,SMITHSONIAN),material/context/object_type(CSV of vocabulary URIs),search_scope(area).#hash(camera/selection):lat/lng/alt/heading/pitch,mode(cluster|point),pid,h3.Facets are additive AND-filters; counts cross-filter. Caveat: in cluster (zoomed-out) mode, material/context/object_type filters do not affect the dots — H3 cluster parquets only carry
dominant_source. Those facets only "bite" at neighborhood/point zoom (the existing#facetNote). Acollectionfacet would inherit the same constraint unlesscollectionis also added to the H3 summaries.Data feasibility analysis
Probed
/current/wide.parquet:SamplingSitelabelscollectionis a sparse facet — mostly OpenContext-style data; most SESAR/GEOME/Smithsonian samples have no sitep__*array columnsTop collections by sample count (the "useful collections" users want):
Çatalhöyük (145,884), Petra Great Temple (108,846), Polis Chrysochous (52,252), Kenan Tepe (42,294), Poggio Civitate (41,679), Ilıpınar (36,947), Čḯxwicən (29,793), Heit el-Ghurab (28,940), Domuztepe (22,394), Emden (20,238).
Verdict: precompute, don't traverse live
Doing the Sample→Event→Site array-join live in DuckDB-WASM per facet interaction is not viable — it's exactly the
list_contains/array-join pattern flagged as the in-browser bottleneck in the Dec-2025 query profiling. The fix is to bake the collection onto samples at build time, reusing existing supplementary files rather than adding a heavyweight new per-sample file:collection/site_label(+site_id) column to the existingsample_facets_v2.parquetduring the build (dictionary-encoded over 60K values → negligible size bump). The explorer then filters by collection through the same facet machinery as material — no live traversal, no new fetch.collections.parquetdimension (~60K rows:site_id, label, source, n_samples, centroid_lat, centroid_lng, bbox). Powers (a) Option B preset URLs (centroid → camera) and (b) a future "browse collections" picker.Proposed implementation (two phases)
Phase B — curated collection presets (quick win, no rebuild)
A small "Collections" page of named canned URLs (camera +
sources+search_scope=area, optionallysearch). Ships immediately to demo the concept; geographically approximate (see flaws above). Good for the top ~10–20 collections.Phase A —
collectionas a first-class dimension (the real fix)site_label/site_idper sample via the traversal; write intosample_facets_v2(+samples_map_litefor display); emitcollections.parquetdimension.collectionfacet wired exactly likematerial(URL param?collection=...,applyQueryToFacetFilters,writeQueryState, cross-filter counts) per theEXPLORER_STATE.mdcontract. Because 60K values is too many for a flat list, render as a searchable/autocomplete facet (or scope the checkbox list to sites with ≥ N samples, full list searchable).collection/site_idto the H3 summary parquets so the facet also bites in cluster mode (otherwise it behaves like material — neighborhood-zoom only).Open questions
collectionbe added to the H3 summaries (cluster-mode honesty) or accept neighborhood-zoom-only like the other facets?site_label(human-readable, but collisions possible) vs.site_id(stable, needs a label join viacollections.parquet)?Analysis performed against
https://data.isamples.org/current/wide.parquet(snapshot 202604). Companion local scripts:pkap_samples.py,find_pkap_geos.py.