Add a 'collection' dimension to the explorer (e.g. OpenContext PKAP) — precompute site membership, then facet

## Summary

Today the explorer can filter samples by **source / material / context / object_type** and by free-text search, but it has **no notion of a "collection"** — a curated, named grouping such as an OpenContext project (e.g. *PKAP — Pyla-Koutsopetria Archaeological Project*) or a SESAR campaign. Users naturally want to say "show me PKAP" and then layer material/context facets on top of that collection. This issue captures the data analysis behind that gap and proposes a two-phase implementation.

## Why "show me PKAP" doesn't work today

PKAP identity does **not** live on the `MaterialSampleRecord` rows the explorer queries. It lives on the **`SamplingSite`** entity, reachable only by multi-hop traversal:

```
MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite(label="PKAP Survey Area")
```

Evidence from `/current/wide.parquet` (snapshot 202604):

| Where "PKAP" appears | otype | count |
|---|---|---:|
| `label = "PKAP Survey Area"` | **SamplingSite** | 1,336 |
| (via traversal) | SamplingEvent | 8,169 |
| in sample `label` | MaterialSampleRecord | **only 166** |

PKAP sample labels look like `4063-34`, `4111-PK-10` — no "PKAP" string on the sample row. Consequences:

- **Text search `PKAP`** matches only ~166 samples (those with it in the label) out of the **15,446** that actually belong to the collection (confirmed by traversal). Unreliable.
- **Geographic deep-link** (center on `lat=34.987406, lng=33.708047` + `sources=OPENCONTEXT` + `search_scope=area`) is location-based, not identity-based: it catches nearby non-PKAP samples and misses any PKAP sample plotted elsewhere.

So there is currently **no way to express a collection view through the UI.**

## How a "view" works today (for reference)

Per `EXPLORER_STATE.md`, a view is fully encoded in the URL:

- **`?query` params** (data/filter state): `search`, `sources` (CSV of `SESAR,OPENCONTEXT,GEOME,SMITHSONIAN`), `material` / `context` / `object_type` (CSV of vocabulary URIs), `search_scope` (`area`).
- **`#hash`** (camera/selection): `lat/lng/alt/heading/pitch`, `mode` (`cluster`|`point`), `pid`, `h3`.

Facets are additive AND-filters; counts cross-filter. **Caveat:** in cluster (zoomed-out) mode, material/context/object_type filters do **not** affect the dots — H3 cluster parquets only carry `dominant_source`. Those facets only "bite" at neighborhood/point zoom (the existing `#facetNote`). A `collection` facet would inherit the same constraint unless `collection` is also added to the H3 summaries.

## Data feasibility analysis

Probed `/current/wide.parquet`:

| Finding | Value | Implication |
|---|---|---:|
| Distinct `SamplingSite` labels | **60,268** | Too many for a flat checkbox list — needs search/autocomplete or a curated high-count subset |
| Samples with a site label | 1.63M of 6.35M (~26%) | `collection` is a **sparse** facet — mostly OpenContext-style data; most SESAR/GEOME/Smithsonian samples have no site |
| PKAP samples (via traversal) | 15,446 | vs. 166 by text search |
| Traversal join cost | ~0.2s (after column load) | Cheap to **precompute**; expensive part is scanning the `p__*` array columns |

**Top collections by sample count** (the "useful collections" users want):
Çatalhöyük (145,884), Petra Great Temple (108,846), Polis Chrysochous (52,252), Kenan Tepe (42,294), Poggio Civitate (41,679), Ilıpınar (36,947), Čḯxwicən (29,793), Heit el-Ghurab (28,940), Domuztepe (22,394), Emden (20,238).

### Verdict: precompute, don't traverse live

Doing the Sample→Event→Site array-join **live in DuckDB-WASM per facet interaction is not viable** — it's exactly the `list_contains`/array-join pattern flagged as the in-browser bottleneck in the Dec-2025 query profiling. The fix is to **bake the collection onto samples at build time**, reusing existing supplementary files rather than adding a heavyweight new per-sample file:

1. **Add a `collection` / `site_label` (+ `site_id`) column to the existing `sample_facets_v2.parquet`** during the build (dictionary-encoded over 60K values → negligible size bump). The explorer then filters by collection through the **same facet machinery** as material — no live traversal, no new fetch.
2. **Add one small `collections.parquet` dimension** (~60K rows: `site_id, label, source, n_samples, centroid_lat, centroid_lng, bbox`). Powers (a) Option B preset URLs (centroid → camera) and (b) a future "browse collections" picker.

## Proposed implementation (two phases)

### Phase B — curated collection presets (quick win, no rebuild)
A small "Collections" page of named canned URLs (camera + `sources` + `search_scope=area`, optionally `search`). Ships immediately to demo the concept; geographically approximate (see flaws above). Good for the top ~10–20 collections.

### Phase A — `collection` as a first-class dimension (the real fix)
- Build step: compute `site_label`/`site_id` per sample via the traversal; write into `sample_facets_v2` (+ `samples_map_lite` for display); emit `collections.parquet` dimension.
- Explorer: add a `collection` facet wired exactly like `material` (URL param `?collection=...`, `applyQueryToFacetFilters`, `writeQueryState`, cross-filter counts) per the `EXPLORER_STATE.md` contract. Because 60K values is too many for a flat list, render as a **searchable/autocomplete** facet (or scope the checkbox list to sites with ≥ N samples, full list searchable).
- Optional: add `collection`/`site_id` to the H3 summary parquets so the facet also bites in cluster mode (otherwise it behaves like material — neighborhood-zoom only).

## Open questions
- Facet UX for 60K collections: searchable autocomplete vs. curated high-count subset vs. both?
- Should `collection` be added to the H3 summaries (cluster-mode honesty) or accept neighborhood-zoom-only like the other facets?
- Naming/identity: facet on `site_label` (human-readable, but collisions possible) vs. `site_id` (stable, needs a label join via `collections.parquet`)?
- Do we surface a dedicated "Collections" landing/browse page, or only the facet?

---
*Analysis performed against `https://data.isamples.org/current/wide.parquet` (snapshot 202604). Companion local scripts: `pkap_samples.py`, `find_pkap_geos.py`.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a 'collection' dimension to the explorer (e.g. OpenContext PKAP) — precompute site membership, then facet #243

Summary

Why "show me PKAP" doesn't work today

How a "view" works today (for reference)

Data feasibility analysis

Verdict: precompute, don't traverse live

Proposed implementation (two phases)

Phase B — curated collection presets (quick win, no rebuild)

Phase A — `collection` as a first-class dimension (the real fix)

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Where "PKAP" appears	otype	count
`label = "PKAP Survey Area"`	SamplingSite	1,336
(via traversal)	SamplingEvent	8,169
in sample `label`	MaterialSampleRecord	only 166

Finding	Value	Implication
Distinct `SamplingSite` labels	60,268	Too many for a flat checkbox list — needs search/autocomplete or a curated high-count subset
Samples with a site label	1.63M of 6.35M (~26%)	`collection` is a sparse facet — mostly OpenContext-style data; most SESAR/GEOME/Smithsonian samples have no site
PKAP samples (via traversal)	15,446	vs. 166 by text search
Traversal join cost	~0.2s (after column load)	Cheap to precompute; expensive part is scanning the `p__*` array columns

Add a 'collection' dimension to the explorer (e.g. OpenContext PKAP) — precompute site membership, then facet #243

Description

Summary

Why "show me PKAP" doesn't work today

How a "view" works today (for reference)

Data feasibility analysis

Verdict: precompute, don't traverse live

Proposed implementation (two phases)

Phase B — curated collection presets (quick win, no rebuild)

Phase A — collection as a first-class dimension (the real fix)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase A — `collection` as a first-class dimension (the real fix)