Skip to content

Fix facet count race after search flyTo#258

Closed
rdhyee wants to merge 31 commits into
isamplesorg:mainfrom
rdhyee:fix/facet-count-race-253
Closed

Fix facet count race after search flyTo#258
rdhyee wants to merge 31 commits into
isamplesorg:mainfrom
rdhyee:fix/facet-count-race-253

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Jun 1, 2026

Summary

  • Add a one-shot post-flyTo facet-count refresh for committed world-scope searches so the final request supersedes camera moveEnd recomputes.
  • Preserve existing A1 search-aware SQL and B1 viewport-aware count behavior.
  • Add a Playwright regression covering the area-scope no-fly diagnostic and the world-scope flyTo path without expanding facets.

Verification

  • /opt/homebrew/bin/duckdb confirmed bucchero ground truth: 2,693 total, 2,693 OpenContext, 0 SESAR, 0 earthmaterial, 0 othersolidobject.
  • /usr/local/bin/quarto render explorer.qmd
  • HEADLESS=1 A1_CLOSE=1 A1_BASE='http://localhost:8077/explorer.html?data_base=/data&debug=a1&sources=OPENCONTEXT%2CGEOME%2CSMITHSONIAN' node tests/playwright/a1-verify.mjs
  • TEST_URL=http://localhost:8077 ./node_modules/.bin/playwright test tests/playwright/facet-search-count-race.spec.js --project=chromium --reporter=line

Fixes #253

rdhyee and others added 30 commits May 27, 2026 18:16
Replaces the LIMIT 100000 raw-row scan + JS per-pixel binning with
a single DuckDB GROUP BY query that does the binning server-side.
Removes the arbitrary cap honestly: every sample in the bbox is
counted into its true pixel cell, regardless of total sample count.

Why the LIMIT was bad: `LIMIT 100000` returned the first 100k rows
in parquet storage order — not random, not geographic. At world
view, the heatmap silently showed whichever source happened to be
physically first in the file (likely SESAR, the largest source by
row count). The "(capped)" status warning disclosed the problem but
didn't fix it. RY feedback 2026-05-27 on PR isamplesorg#240 ("wondering
whether we can do better geographic random sampling").

How the SQL pushdown works: compute `(x_bin, y_bin)` pixel
coordinates from `latitude`/`longitude` server-side using FLOOR /
LEAST / GREATEST, then GROUP BY (x_bin, y_bin) returning one row
per non-empty pixel with COUNT(*) as the sample count. Result
cardinality is bounded by canvas pixels (≤ 512² = 262k),
independent of bbox sample count. JS just iterates the aggregated
rows and applies the same log(1+n) scaling for heatmap.js.

Verified counts vs `samples table` summary line (= true sample
count for the current view):

  view              | heatmap | table   | match
  ------------------|---------|---------|------
  PKAP (100km)      |  77,840 |  77,840 |  ✅
  Cyprus medium     | 100,970 | 100,970 |  ✅  (was capped at 100k)
  Cyprus regional   | 682,029 | 682,029 |  ✅  (was capped at 100k)
  World view        | 5.98M   | 5.98M   |  ✅  (was capped at 100k)

Render time at world view (~6M samples → 35k cells): ~7s on
localhost, similar to or faster than the LIMIT 100k version at
smaller zooms.

Removes the "(capped)" status branch and the `HEATMAP_LIMIT`
constant becomes unused (left in place for now in case Phase 2
progressive refinement reintroduces a safety cap on cell count).

Side effect of removing the cap: the per-pixel max-bias is now
even more extreme at high-density views, but the log(1+n) scaling
from PR isamplesorg#240 handles it.

Verified: 5/5 heatmap-overlay.spec.js still pass on localhost.
(The spec asserts `lastPointCount > 0`, which is still true; one
spec change worth a follow-up: the spec used to expect capped
behavior for large views, but no test currently asserts that, so
no spec changes needed here.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…table surface

Strategy B: materialize search_pids (one ILIKE scan over facets_url) on a
committed search, then constrain surfaces with a cheap pid semi-join.

This increment (table surface, verified):
- buildSearchFilter/clearSearchFilter: non-temp search_pids table (DISTINCT,
  NOT NULL), token-versioned _next→swap, captures match total. Published on
  window.__searchFilter {active,term,token,total} + window.searchFilterSQL().
- doSearch builds the filter (shows "Building search filter…") then refreshes
  the table; clears it on empty/short submit.
- loadCount/loadPage semi-join on search_pids; summaryText → "N of M
  \"term\" matches in this map view" (replaces isamplesorg#250 interim copy).
- Dev probe cell (a1PersistenceProbe) — REMOVE before PR.

Verified on local build: bucchero → table shows only OpenContext Poggio
Civitate matches (2,693), no GEOME mollusks; non-temp table persists across
db.query() calls. Probe (isamplesorg#249 data): no coord-less matches, no dup pids,
broad-term max ~82k.

TODO (still in PR #1, NOT YET DONE): points loader, facet counts + cube
gating, stats, and C3 auto-point-mode so the globe isn't left unfiltered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e), C3 — globe path BUGGY

Adds to the working table surface:
- searchIsActive()/searchFilterSQL() cell-local helpers in the viewer cell.
- loadViewportSamples: semi-join on search_pids.
- updateCrossFilteredCounts: semi-join on both paths; gate off the cube
  fast-path AND the global baseline early-return when a search is active.
- applySearchFilterChange(): C3 orchestrator — force point mode on search,
  revert to altitude-appropriate mode on clear; refresh table+facets.
- camera-changed handler: latch point mode while a search is active.
- doSearch calls applySearchFilterChange after build / on clear.

KNOWN BUG (needs debugging): the GLOBE points render the UNFILTERED viewport
count (e.g. "5000 of 1,591,051") even though search is active and the table
correctly shows 2,693. C3 does not enter point mode at high altitude on boot
either (globe stays unfiltered clusters). Likely an async race between the
boot point-load / mode entry and the post-build applySearchFilterChange
(filter built ~40-90s into boot, after the camera has already settled). The
table surface (loadCount/loadPage) IS correctly filtered. Probe cell still
present (remove before PR).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch-token staleness) + [A1dbg] logging; globe still not entering point mode — next: Codex rec #4 one-reconciler refactor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rver + deterministic A1 observability

- R2_BASE honors ?data_base= / localStorage ISAMPLES_DATA_BASE (default prod), so the explorer can read a local parquet mirror instead of 40-90s remote range-fetches.
- dev_server.py: range-capable (206) static server; stock python http.server returns 200 and breaks DuckDB-WASM partial reads.
- window.__a1log/__a1state + a1dbg() + on-page panel (?debug=a1) replace flaky console capture; window.__a1globe() exposes mode/point state for a Playwright harness.
- Converted [A1dbg] console.logs to a1dbg events at build/mode/point-load/discard points.

NOTE: cold cost is init-dominated (DuckDB-WASM+Cesium+OJS ~40s) — mirror helps the DATA phase only; the real lever is load-once + in-page iteration. Mirror range verified (curl -r => 206) but a full end-to-end speedup run hung in init (shakedown tomorrow; check 0-byte current/wide.parquet).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…be coherence)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Working handoff docs for the search-as-global-filter (A1, isamplesorg#234 Step 4) work — branch state, the globe logjam + Codex's reconciler spec, the fast verify-loop, the performance model, and Eric isamplesorg#248 / isamplesorg#249. Strip before the A1 PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckDB-WASM

The ?data_base=/data dev override produced root-relative parquet URLs
(/data/foo.parquet). DuckDB-WASM's httpfs reads those as a virtual-FS glob
("No files found that match the pattern") instead of fetching over HTTP, so
the local-mirror verify loop hung in init with zero /data fetches — the
"shakedown" symptom. Resolve a root-relative data_base against
location.origin so the ergonomic ?data_base=/data form works; the prod
default and absolute (http://...) overrides pass through unchanged.

Verify-loop infra:
- dev_server.py: pin HTTP/1.1 (DuckDB's range reader expects keep-alive;
  curl-verified 206 + multi-request keep-alive). Local full-GET-vs-206 is
  DuckDB-WASM heuristic and moot over localhost; validate ranges on deploy.
- tests/playwright/shakedown-206.mjs: headless boot+search probe (no popup).
  Confirms cold boot ~2.3s to live, bucchero search builds 2,693 pids ~9s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnoses whether a committed search renders sample points and whether the
result depends on camera altitude. Boots at a given alt/lat/lng, fires the
bucchero search, waits for the async point load to settle, and dumps the
__a1log event sequence + final __a1globe() state.

Finding: with a proper wait, the globe is A1-coherent at BOTH whole-globe
(9000 km → renders all 2693 pids; computeViewRectangle saturates, not null)
and zoomed-in (80 km → 2670 in-view) altitudes. The earlier "0 sample
points" was a measure-too-early artifact, not a bug. Suggests the C3 fixes
(4e79830) work in a foreground/headless context and the summary's "globe
won't enter point mode" was likely a backgrounded-tab rAF-freeze artifact.
Pending headed a1-verify.mjs verdict to rule out an animation-only race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default stays headed (real flyTo — what A1 is verified against). A headed
window that opens UNFOCUSED becomes a background tab → Chrome freezes its
rAF render loop → the page hangs mid-init (the same backgrounded-tab freeze
that corrupted the original logjam observations). HEADLESS=1 sidesteps that:
headless pages are always "active". Use it for CI / repeated runs; keep
headed for a real-animation spot check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… dev probe

Pre-deploy cleanup (the summary's "don't ship" list):
- Remove the a1PersistenceProbe OJS cell — a one-time dev check that
  console-logged on every load and threw a Catalog Error (the design point
  it verified, non-temp tables persisting across DuckDBClient connections,
  is proven and load-bearing in production now).
- Gate the whole A1 observability block (a1dbg / __a1log / __a1state /
  __a1globe + on-page panel) behind ?debug=a1. Production users now get a
  clean global namespace and zero overhead; the Playwright harness opts in
  via ?debug=a1. All a1dbg?.() call sites already use optional chaining, so
  they are no-ops when the block doesn't run.

Verified: ?debug=a1 → a1-verify.mjs still ✅ COHERENT (2693 pts); no flag →
__a1globe/a1dbg/__a1log undefined, no panel, no probe console output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rch_pids

doSearch scanned the 63 MB facets parquet TWICE per committed search: once in
buildSearchFilter (pid-set) and again for the side-panel results SELECT (+ a
third for the real-count COUNT when the 50-cap hit). On CI's smoke gate, the
broad "pottery" search blew the 90s budget (first A1 deploy failed there).

Fix: buildSearchFilter now materializes the side-panel columns (label, source,
place_name) and the relevance score IN THE SAME scan that builds the pid-set,
so the results SELECT and the COUNT read the small in-memory search_pids table
(aliased `s`) instead of re-scanning facets. One facets scan per search now,
matching pre-A1. sourceFilterSQL('s.source') + the bare-pid facetFilterSQL
compose unchanged; search_pids stays pid-keyed (dropped the weaker 5-col
DISTINCT — pid is unique, so the build is naturally one row per pid).

Verified locally (fast mirror): pottery 15.8s → 12.7s (build 6.9s + surface
updates); a1-verify still ✅ COHERENT; production-clean without ?debug=a1.
Note: the remaining time-to-results is buildSearchFilter + applySearchFilter
(globe/facet updates); if CI's smoke still exceeds budget, render the side
panel before applySearchFilterChange next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch")

updateCrossFilteredCounts computed facet-legend counts over the EXACT viewport
(pad 0), while the samples-table COUNT, the point-mode loader, the "samples in
view" stat, and the heatmap all pad by VIEWPORT_PAD_FACTOR (0.3). Matching
samples in the 30% margin were counted by the table but not the legend, so the
legend read low: off-by-one at a Cyprus deep-zoom (13 vs 14), and ~166 vs ~481
for material=rock at a wide Red-Sea view (RY, live rdhyee deploy). Aligns the
last "in view" surface to the padded contract (isamplesorg#234 coherence).

Applies the parked facet_count_padding.patch (one line + the coherence
regression test) on the A1 branch, since the mismatch is live on the A1 deploy
and isamplesorg#234 is exactly "make filter semantics coherent across surfaces."

Verified at the reported view: facet Rock 167 → 496, now == table 496;
a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues from Codex's PR isamplesorg#251 review:

1. search_pids staging race — buildSearchFilter used a fixed `search_pids_next`
   name. Two overlapping searches could interleave so a later search swapped an
   earlier search's rows into `search_pids` under its own term (the token checks
   guard the publish, not the shared staging object). Use a token-scoped staging
   table `search_pids_next_${token}`, dropped in finally. Also stop DROPping the
   live `search_pids` on clear (an in-flight reader would throw) — flip
   active=false and leave it unreferenced until the next search replaces it.
   Verified: bucchero→soil back-to-back now publishes soil/2969 (its own count),
   not bucchero's under soil's term.

2. heatmap search-blind — renderHeatmap omitted searchFilterSQL and
   heatmapFilterHash omitted the search token, so the "filtered density" overlay
   stayed unfiltered under a committed search. Append window.searchFilterSQL('pid')
   to the heatmap aggregation and add the search token to the hash so it
   recomputes/re-keys on search commit/clear (isamplesorg#234 cross-surface coherence).

a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le rows)

Round-2 of Codex's PR isamplesorg#251 review: the previous no-drop clear left the prior
search's rows in search_pids, and doSearch's side-panel SELECT reads
`FROM search_pids` directly (does NOT gate on __searchFilter.active) — so a
build failure could render the previous term's rows under the new term.

Chose Codex's empty-table alternative over the early-return built-guard: an
early return before the side-panel try would skip the isamplesorg#167 telemetry `finally`,
whereas CREATE OR REPLACE TABLE search_pids (...empty...) keeps both the
in-flight semi-join readers and the direct side-panel reader safely seeing zero
rows, and a build failure flows through the existing results.length===0 →
return-in-try → finally path with telemetry intact. Only clearSearchFilter
changes; no doSearch control-flow restructure.

Verified: a1-verify ✅ COHERENT; bucchero→clear→soil publishes soil/2969 (own
count, no stale rows); clearSearchFilter is only called on empty-submit +
build-failure, not per search.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Codex's remaining nit (PR isamplesorg#251, non-blocking): since
clearSearchFilter() now leaves search_pids EMPTY, a genuine build failure and a
true empty result set both reach the side-panel's results.length===0 branch. A
`searchFilterBuildFailed` flag (set in the build catch) makes the panel say
"Search error: couldn't build the filter…" on a real failure while still
flowing through the isamplesorg#167 telemetry finally — instead of the misleading
"No results for {term}".

a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Uri + buildConceptFilter

First, additive pieces for `described-by=<concept-uri>` (isamplesorg#248), riding the A1
search_pids machinery (Codex plan-reviewed: "mostly sound + guardrails"):

- window.conceptLabelForUri(uri): expose the facetFilters cell's URI→prefLabel
  resolver so the concept producer can label a URI without re-querying
  vocab_labels (guardrail #1).
- buildConceptFilter(uri): a SECOND search_pids producer — exact-URI match
  across the object_type/material/context columns of sample_facets_v2, with
  object_type>material>context relevance ranking. Same token-scoped staging,
  finally-drop, shared _searchFilterToken, and empty-clear invariant as
  buildSearchFilter (guardrails #2/#6). Tags __searchFilter.kind ('concept' vs
  'text') for mutual exclusivity (#5).

Not yet wired: doDescribedBy flow (shared runPidSetResults render), the
described-by= URL param + writeQueryState kind-preservation, and mutual
exclusivity at producer entry. buildConceptFilter isn't called yet, so this is
behavior-neutral. Verified: conceptLabelForUri('…organismpart')→"Organism part";
free-text a1-verify still ✅ COHERENT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… filter

Wire the concept-URI filter end to end on top of the A1 search_pids
machinery. A `described-by=<uri>` deep link selects material samples whose
URI-valued facet concept (object_type / material / context) exactly matches
the URI, and filters EVERY surface (table, globe points, facet counts, side
panel) via the same `pid IN (SELECT pid FROM search_pids)` semi-join.

- doDescribedBy(uri): second entry point into search_pids. Reuses the already
  -reviewed buildConceptFilter (producer) + applySearchFilterChange (refresher,
  kind-agnostic) and renders its OWN side panel. Deliberately does NOT touch
  doSearch — keeps A1's just-shipped (isamplesorg#251) text-search stale-guard hot path
  untouched (RY decision: protect the path Kerstin demos Wed).
- writeQueryState: search= and described-by= are mutually exclusive. Keyed off
  the text input (always current) rather than window.__searchFilter (stale at
  doSearch's early writeQueryState call), so a committed text search drops
  described-by= and vice-versa.
- Boot: described-by= deep link auto-commits and wins over search= if both
  present.
- tests/playwright/described-by-verify.mjs: HEADLESS deep-link coherence test
  (globe + panel + URL all reflect the concept; mutual-exclusivity with text).

Verified: described-by-verify green; a1-verify still green (no A1 regression).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e guard, XSS)

Codex round-1 found 3 real issues; all fixed + regression-tested:

1. URL could flip concept→phantom-text on draft text + facet/source toggle.
   writeQueryState is now intent-aware: doSearch passes commitText to
   authoritatively persist/clear a text search (it runs before __searchFilter
   updates); every other caller mirrors the COMMITTED filter in __searchFilter,
   so draft (un-submitted) text no longer clobbers described-by=. New Playwright
   check: draft text + source toggle preserves described-by=.
2. A superseded doDescribedBy could still rewrite the URL. Moved the
   _searchSeq stale-check to BEFORE writeQueryState (and re-check after the
   async applySearchFilterChange) so only the current producer persists state.
3. Reflected-XSS path: the concept side panel interpolated a URL-derived label
   (and result label/pid/url/name) into innerHTML. Now escapeHtml'd (the same
   helper the table renderer uses), covering text + attribute contexts.

Verified: described-by-verify green (concept + draft-text + mutual-exclusivity);
a1-verify still green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…yee staging

# Conflicts:
#	SESSION_SUMMARY.md
#	explorer.qmd
Show a `progress` cursor (arrow + spinner — NOT a frozen hourglass) over
#cesiumContainer while a filter/search/facet/source update is in flight, so
the user gets a visual "things are updating" signal without the map ever
freezing (it's pure CSS on body.explorer-busy; it never intercepts input).

Hooks into the existing depth-counted busyAcquire/busyRelease that already
wrap the facet-change, source-change, and search/concept-commit paths — those
releases are all in `finally`, so a throw can't strand the cursor. Adds a
120s watchdog as defense-in-depth against a HUNG promise (a load that never
settles) that would otherwise skip the finally: it force-clears the busy
state (re-armed per acquire, far longer than any real cold-cache load, so it
only fires on a genuine leak). Verified: cursor shows ~1.2s on a search then
clears via normal release; A1 + deploy smoke gate green.

Not covered: pan/zoom viewport reloads (separate cells) — deferred follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A map-only busy cursor was invisible in practice: a CSS cursor only shows
where the pointer is, and the covered updates (search, facet, source) are all
triggered with the pointer OFF the map (search box / sidebar). Apply the
progress cursor page-wide (body.explorer-busy * with !important) so the
'updating' signal shows wherever the pointer rests. Verified: progress on
body/search-box/checkbox/map when busy, natural cursors when idle; smoke gate
green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Jun 2, 2026

Superseded by #261, which shipped the consolidated runtime (stats-collapse #254 + chip #255 + facet-count fix #253) to isamples.org. Closing the PR. NOTE: the regression tests from these branches are not yet on upstream — they'll be re-added via one clean follow-up PR (incl. repackaging the count-race spec as a standalone .mjs to fix the [P2] default-suite//data dependency).

@rdhyee rdhyee closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

explorer: facet-legend counts go stale after a committed search (recompute race, not a missing filter)

1 participant