Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions COLLECTION_FACET_CODEX_PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Codex prompt — Option A: first-class `collection` facet in the iSamples explorer

> Paste the block below into Codex (run from `~/C/src/iSamples/isamplesorg.github.io`,
> which has `AGENTS.md`; the repo root has `.codex/config.toml` with Playwright MCP).
> Tracks issue isamplesorg/isamplesorg.github.io#243. Plan-first with a sign-off gate.

---

```
GOAL
Add a first-class "collection" dimension to the iSamples interactive explorer
(explorer.qmd) so users can filter samples to a named collection — e.g. the
OpenContext project "PKAP Survey Area" — and layer the existing material /
context / object_type facets on top. Full background, data analysis, and the
two-phase plan are in issue #243.

DO THIS IN TWO STAGES. Stage 1: produce a written implementation plan and STOP
for my sign-off. Stage 2 (only after I approve): implement.

=== KEY DESIGN FACTS (already verified — do not re-derive) ===
- A "collection" is the `label` of a SamplingSite entity. It is NOT on the
MaterialSampleRecord rows; it is reached by traversal:
MaterialSampleRecord.p__produced_by[1] -> SamplingEvent
SamplingEvent.p__sampling_site[1] -> SamplingSite.label
(All within the wide parquet; `otype` column distinguishes entity types.)
- Cardinality: ~60,268 distinct SamplingSite labels; only ~1.63M of 6.35M
samples have a site (sparse facet, mostly OpenContext). PKAP = 15,446 samples.
- Doing this traversal LIVE in DuckDB-WASM per interaction is NOT viable (it is
the array-join pattern profiled as the in-browser bottleneck). MUST precompute.
- Data is served from https://data.isamples.org/ (Cloudflare Worker -> R2).
NEVER reference raw pub-*.r2.dev URLs.

=== HOW FACETS WORK TODAY (anchors in explorer.qmd) ===
- Parquet URL constants: R2_BASE (:683), wide_url=/current/wide.parquet (:690),
facets_url=…sample_facets_v2.parquet (:692), facet_summaries_url (:693),
cross_filter_url (:695), vocab_labels_url (:698), lite_url (:687),
h3_res{4,6,8}_url (:684-686).
- The facet filter predicate (:942):
AND pid IN (SELECT DISTINCT pid FROM read_parquet('${facets_url}')
WHERE <conds>)
i.e. per-sample facet values live in sample_facets_v2.parquet, keyed by pid.
- Facet checkbox lists + counts are rendered by renderFilter(...) (:~1792) from
facet_summaries (value -> count); cross-filtered counts use facet_cross_filter.
- material/context/object_type values are vocabulary URIs labeled via
vocab_labels.parquet. NOTE: a collection's "value" is a SamplingSite identity
(site_id) labeled from the NEW collections dimension below — NOT a vocab URI.
- URL/state contract is normative in EXPLORER_STATE.md. The four query params
today are search, sources, material, context, object_type (+ search_scope).
A new `collection` param must follow the SAME lifecycle as `material`:
applyQueryToFacetFilters (hydrate), handleFacetFilterChange ->
writeQueryState() (write-back), cross-filter count recompute, param removed
when empty. Honor the Quarto `?q=` collision note (use `collection`, not `q`).
- Cluster-mode honesty: H3 summary parquets only carry dominant_source, so
material/context/object_type filters do NOT affect zoomed-out clusters (the
#facetNote). A `collection` facet inherits this unless collection is also
added to the H3 summaries — call this out; do not silently break the note.

=== STEP 0 (do first, report findings) ===
Locate the build pipeline that PRODUCES the supplementary parquets
(sample_facets_v2, samples_map_lite, h3_summary_res{4,6,8}, facet_summaries,
facet_cross_filter) and uploads them to R2. They are NOT in this repo's
scripts/. Search the sibling repos and data dirs:
~/C/src/iSamples/{isamples-python,pqg,isamplesorg.github.io-duckdb-spike}
~/Data/iSample/ (esp. pqg_refining/)
and any notebooks. Also read workers/data-isamples-org/README.md for the R2
serving/versioning layer. Report exactly how each file is built and uploaded,
or state that a build path must be created from scratch.

=== STAGE 1 DELIVERABLE: a written plan covering ===
1. Build: a new script (e.g. scripts/build_collections.py) that, from
/current/wide.parquet, computes per-sample (pid -> site_id, site_label) via
the traversal, and emits:
a) collections.parquet — dimension, ~60K rows:
site_id, label, source, n_samples, centroid_lat, centroid_lng,
bbox(min/max lat/lng). Powers the "search the long tail" half of the UX
and the Featured-Collections presets (collections.qmd).
b) an added `site_id` (+ maybe site_label) column on sample_facets_v2
(regenerate as v3 if v2's builder is unavailable; keep pid as the key so
the :942 predicate extends with one more AND condition).
c) collection rows in facet_summaries (site_id -> count) so the checkbox
list + counts render via the existing machinery. Decide whether to add
collection to facet_cross_filter now or defer (note the consequence).
Define a stable site_id (hash of label, or the SamplingSite pid). Specify
versioned filenames + the /current alias, consistent with existing files.
2. Explorer wiring (explorer.qmd), mirroring `material` exactly:
- new collection facet container + a `?collection=` URL param on the
EXPLORER_STATE.md lifecycle.
- DUAL UX (my decision): top-N collections (>= a sample-count threshold) as
checkboxes reusing renderFilter; PLUS a type-to-search input over
collections.parquet for the long tail (60K). Specify how a search-selected
collection becomes an active filter value alongside the checkboxes.
- extend the :942 predicate (or facets subquery) with the collection
condition; ensure cross-filter counts and #facetNote stay correct.
3. data.qmd + collections.qmd updates: document collections.parquet; once the
facet exists, upgrade the Featured-Collections preset links from
geographic-only to a real &collection=<site_id> filter.
4. Test plan: extend tests/ (pytest + Playwright). At minimum a Playwright check
that ?collection=<PKAP site_id> yields the PKAP sample set and that layering
?material=… narrows it; reproducible DuckDB snippets for the counts.
5. Risks / migration: snapshot-version coupling (site_id stability across
rebuilds), the sparse-facet UX for non-collection sources, cluster-mode
honesty, and file-size deltas.

=== CONSTRAINTS ===
- Read AGENTS.md, ../CLAUDE.md, EXPLORER_STATE.md before planning.
- explorer.qmd is ~3,500 lines of working OJS/JS — make INCREMENTAL, additive
changes mirroring existing facet code; do not refactor working paths.
- Quarto OJS gotcha: cells use `name = value`, NOT top-level const/let/var.
- Static site, no hot reload: note where `quarto preview` + browser refresh is
needed to verify.
- Verify against https://data.isamples.org/ only; never raw pub-*.r2.dev.
- STOP after the Stage 1 plan and wait for my approval before writing code.
```
1 change: 1 addition & 0 deletions EXPLORER_STATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ citations.
| `material` | DOM `#materialFilterBody` checkboxes | omitted (= no filter) | CSV of full URIs | `applyQueryToFacetFilters()` at end of `facetFilters` (`:1061`) | `writeQueryState()` from `handleFacetFilterChange` (`:1642`) | none — checkbox `value` already constrained by render | empty checked set ⇒ param removed (`:459`) |
| `context` | DOM `#contextFilterBody` checkboxes | omitted | CSV of full URIs | same as `material` | same as `material` | none | same |
| `object_type` | DOM `#objectTypeFilterBody` checkboxes | omitted | CSV of full URIs | same as `material` | same as `material` | none | same |
| `collection` | DOM `#collectionFilterBody` checkboxes | omitted (= no filter) | CSV of `collection_id`s (16-hex) | `applyQueryToFacetFilters()` (after the `facetFilters` cell renders top-N ∪ URL ids) | `writeQueryState()` from `handleFacetFilterChange` | none | #243. Values are collection ids from `collections.parquet`, NOT vocab URIs. Filters via a 2nd subquery in `facetFilterSQL()` against `sample_collections.parquet`. NOT cross-filtered (no cross_filter cache); counts shown are the collection's static total. The `#collectionSearch` box adds long-tail rows beyond the top-N checkboxes |
| ~~`view`~~ | _removed in mockup-v1 (#200)_ | — | — | — | — | — | The Globe/Table toggle is gone — the samples table is now permanent below the globe. `writeQueryState()` does `params.delete('view')` to canonicalize legacy bookmarks. See §6 "Mockup-v1 addendum" |
| `search_scope` | local closure `_searchScope` in `zoomWatcher` | omitted (= `world`) | `area` only; absent ⇒ world | `_searchScope` hydrated at top of `zoomWatcher` from `params.get('search_scope')` | `persistSearchScope()` from `doSearch()` and button clicks | exact match `'area'` | sidebar `#sampleSearchSidebar` Enter always submits `world`, never `area` — see §6 mockup-v1 addendum |
| `page` | inner closure `let page = 0` in `tableView` | not in URL | — | — | resets to 0 on `refreshTable()`; ±1 on prev/next | clamped to `[0, totalPages-1]` | **#163 item 6** — table page is intentionally not URL state today; if/when added, must coexist with the cross-filter contract below |
Expand Down
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ website:
text: Home
- href: explorer.qmd
text: Interactive Explorer
- href: collections.qmd
text: Collections
- text: How to Use
menu:
- text: Overview
Expand Down
62 changes: 62 additions & 0 deletions collections.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: "Featured Collections"
subtitle: "Jump straight to well-known sample collections on the interactive globe"
toc: true
categories: [explore, collections]
---

::: {.callout-note}
**Identity-based collection filtering** (issue
[#243](https://github.com/isamplesorg/isamplesorg.github.io/issues/243)). Each
link applies the explorer's `collection` facet (`&collection=<id>`) so you see
*exactly* that collection's samples — not just whatever is near a location — and
flies the globe to the collection's centroid. From there, layer on the Material,
Sampled Feature, or Specimen Type facets to narrow further.
:::

## How to use these

1. Click **Open in Explorer** — the `collection` facet filters to exactly that
collection's samples and the globe flies to its centroid in point mode.
2. **Layer on facets**: open the *Material*, *Sampled Feature*, or *Specimen
Type* panels and check values to narrow within the collection.
3. **Find any collection** — in the explorer, open the **Collection** panel and
type in its search box; the top ~100 collections also appear as checkboxes.
4. **Share what you see** — the URL captures the full view (`collection` +
other facets + camera), so you can bookmark or send any state you reach.

## Featured collections

These are the largest OpenContext project areas in the current snapshot
(`202604`), by sample count.

| Collection | Source | Samples | |
|---|---|---:|---|
| **PKAP — Pyla-Koutsopetria Survey Area** (Cyprus) | OpenContext | 15,446 | [Open in Explorer](explorer.html?collection=dd74c71982da0e21#v=1&lat=34.9836&lng=33.7071&alt=40000&mode=point) |
| Çatalhöyük (Turkey) | OpenContext | 145,884 | [Open in Explorer](explorer.html?collection=20365f0e3b27dc8e#v=1&lat=37.6682&lng=32.8272&alt=40000&mode=point) |
| Petra Great Temple (Jordan) | OpenContext | 108,846 | [Open in Explorer](explorer.html?collection=1ef8673aa89023c1#v=1&lat=30.3287&lng=35.4421&alt=40000&mode=point) |
| Polis Chrysochous (Cyprus) | OpenContext | 52,252 | [Open in Explorer](explorer.html?collection=756f324a7d902068#v=1&lat=35.0349&lng=32.4218&alt=40000&mode=point) |
| Kenan Tepe (Turkey) | OpenContext | 42,294 | [Open in Explorer](explorer.html?collection=732469b20b632815#v=1&lat=37.8307&lng=40.8137&alt=40000&mode=point) |
| Poggio Civitate (Italy) | OpenContext | 41,679 | [Open in Explorer](explorer.html?collection=a5e653d3b3704b95#v=1&lat=43.1529&lng=11.4016&alt=40000&mode=point) |
| Ilıpınar (Turkey) | OpenContext | 36,947 | [Open in Explorer](explorer.html?collection=2308de8c25a27090#v=1&lat=40.4683&lng=29.3091&alt=40000&mode=point) |
| Čḯxwicən (Washington, USA) | OpenContext | 29,793 | [Open in Explorer](explorer.html?collection=84eb590024898ba9#v=1&lat=48.1315&lng=-123.4628&alt=40000&mode=point) |
| Heit el-Ghurab / Giza (Egypt) | OpenContext | 28,940 | [Open in Explorer](explorer.html?collection=cb1775e663696ce6#v=1&lat=29.9711&lng=31.1413&alt=40000&mode=point) |
| Domuztepe (Turkey) | OpenContext | 22,394 | [Open in Explorer](explorer.html?collection=d452bbb04ea0d100#v=1&lat=37.3226&lng=37.0349&alt=40000&mode=point) |
| Forcello Bagnolo San Vito (Italy) | OpenContext | 18,573 | [Open in Explorer](explorer.html?collection=c59e2c8620cde574#v=1&lat=45.0897&lng=10.8754&alt=40000&mode=point) |
| Chogha Mish (Iran) | OpenContext | 16,827 | [Open in Explorer](explorer.html?collection=49e189be61689b3d#v=1&lat=32.2240&lng=48.5559&alt=40000&mode=point) |

## What a preset URL is made of

```
explorer.html
?collection=dd74c71982da0e21 # the collection facet (PKAP Survey Area)
#v=1 # hash schema version
&lat=34.9836&lng=33.7071 # camera target (collection centroid)
&alt=40000 # 40 km altitude → point mode
&mode=point # force individual sample dots
```

The `collection` value is a stable id (a hash of source + collection name) from
`collections.parquet`. To build your own view, apply any combination of facets
and camera in the explorer, then copy the browser's URL — every part of the
state is encoded there.
1 change: 1 addition & 0 deletions data.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ cite `https://data.isamples.org/<file>`.
| Aggregate map clusters by zoom | [`h3_summary_res{4,6,8}.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | ≤ 2.4 MB each |
| Filter by material / context / object-type | [`sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB |
| Walk relationships (graph queries) | [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB |
| Browse / filter by collection (e.g. an OpenContext project) | [`collections.parquet`](https://data.isamples.org/isamples_202604_collections.parquet) + [`sample_collections.parquet`](https://data.isamples.org/isamples_202604_sample_collections.parquet) | 3 MB + 13 MB |
| Translate vocabulary URIs to human-readable labels | [`vocab_labels.parquet`](https://data.isamples.org/vocab_labels.parquet) | 58 KB |

## 3. Copy-pasteable DuckDB snippets
Expand Down
Loading
Loading