Skip to content

Add a 'collection' dimension to the explorer (e.g. OpenContext PKAP) — precompute site membership, then facet #243

@rdhyee

Description

@rdhyee

Summary

Today the explorer can filter samples by source / material / context / object_type and by free-text search, but it has no notion of a "collection" — a curated, named grouping such as an OpenContext project (e.g. PKAP — Pyla-Koutsopetria Archaeological Project) or a SESAR campaign. Users naturally want to say "show me PKAP" and then layer material/context facets on top of that collection. This issue captures the data analysis behind that gap and proposes a two-phase implementation.

Why "show me PKAP" doesn't work today

PKAP identity does not live on the MaterialSampleRecord rows the explorer queries. It lives on the SamplingSite entity, reachable only by multi-hop traversal:

MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite(label="PKAP Survey Area")

Evidence from /current/wide.parquet (snapshot 202604):

Where "PKAP" appears otype count
label = "PKAP Survey Area" SamplingSite 1,336
(via traversal) SamplingEvent 8,169
in sample label MaterialSampleRecord only 166

PKAP sample labels look like 4063-34, 4111-PK-10 — no "PKAP" string on the sample row. Consequences:

  • Text search PKAP matches only ~166 samples (those with it in the label) out of the 15,446 that actually belong to the collection (confirmed by traversal). Unreliable.
  • Geographic deep-link (center on lat=34.987406, lng=33.708047 + sources=OPENCONTEXT + search_scope=area) is location-based, not identity-based: it catches nearby non-PKAP samples and misses any PKAP sample plotted elsewhere.

So there is currently no way to express a collection view through the UI.

How a "view" works today (for reference)

Per EXPLORER_STATE.md, a view is fully encoded in the URL:

  • ?query params (data/filter state): search, sources (CSV of SESAR,OPENCONTEXT,GEOME,SMITHSONIAN), material / context / object_type (CSV of vocabulary URIs), search_scope (area).
  • #hash (camera/selection): lat/lng/alt/heading/pitch, mode (cluster|point), pid, h3.

Facets are additive AND-filters; counts cross-filter. Caveat: in cluster (zoomed-out) mode, material/context/object_type filters do not affect the dots — H3 cluster parquets only carry dominant_source. Those facets only "bite" at neighborhood/point zoom (the existing #facetNote). A collection facet would inherit the same constraint unless collection is also added to the H3 summaries.

Data feasibility analysis

Probed /current/wide.parquet:

Finding Value Implication
Distinct SamplingSite labels 60,268 Too many for a flat checkbox list — needs search/autocomplete or a curated high-count subset
Samples with a site label 1.63M of 6.35M (~26%) collection is a sparse facet — mostly OpenContext-style data; most SESAR/GEOME/Smithsonian samples have no site
PKAP samples (via traversal) 15,446 vs. 166 by text search
Traversal join cost ~0.2s (after column load) Cheap to precompute; expensive part is scanning the p__* array columns

Top collections by sample count (the "useful collections" users want):
Çatalhöyük (145,884), Petra Great Temple (108,846), Polis Chrysochous (52,252), Kenan Tepe (42,294), Poggio Civitate (41,679), Ilıpınar (36,947), Čḯxwicən (29,793), Heit el-Ghurab (28,940), Domuztepe (22,394), Emden (20,238).

Verdict: precompute, don't traverse live

Doing the Sample→Event→Site array-join live in DuckDB-WASM per facet interaction is not viable — it's exactly the list_contains/array-join pattern flagged as the in-browser bottleneck in the Dec-2025 query profiling. The fix is to bake the collection onto samples at build time, reusing existing supplementary files rather than adding a heavyweight new per-sample file:

  1. Add a collection / site_label (+ site_id) column to the existing sample_facets_v2.parquet during the build (dictionary-encoded over 60K values → negligible size bump). The explorer then filters by collection through the same facet machinery as material — no live traversal, no new fetch.
  2. Add one small collections.parquet dimension (~60K rows: site_id, label, source, n_samples, centroid_lat, centroid_lng, bbox). Powers (a) Option B preset URLs (centroid → camera) and (b) a future "browse collections" picker.

Proposed implementation (two phases)

Phase B — curated collection presets (quick win, no rebuild)

A small "Collections" page of named canned URLs (camera + sources + search_scope=area, optionally search). Ships immediately to demo the concept; geographically approximate (see flaws above). Good for the top ~10–20 collections.

Phase A — collection as a first-class dimension (the real fix)

  • Build step: compute site_label/site_id per sample via the traversal; write into sample_facets_v2 (+ samples_map_lite for display); emit collections.parquet dimension.
  • Explorer: add a collection facet wired exactly like material (URL param ?collection=..., applyQueryToFacetFilters, writeQueryState, cross-filter counts) per the EXPLORER_STATE.md contract. Because 60K values is too many for a flat list, render as a searchable/autocomplete facet (or scope the checkbox list to sites with ≥ N samples, full list searchable).
  • Optional: add collection/site_id to the H3 summary parquets so the facet also bites in cluster mode (otherwise it behaves like material — neighborhood-zoom only).

Open questions

  • Facet UX for 60K collections: searchable autocomplete vs. curated high-count subset vs. both?
  • Should collection be added to the H3 summaries (cluster-mode honesty) or accept neighborhood-zoom-only like the other facets?
  • Naming/identity: facet on site_label (human-readable, but collisions possible) vs. site_id (stable, needs a label join via collections.parquet)?
  • Do we surface a dedicated "Collections" landing/browse page, or only the facet?

Analysis performed against https://data.isamples.org/current/wide.parquet (snapshot 202604). Companion local scripts: pkap_samples.py, find_pkap_geos.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions