Add keyset pagination and time-range filters to the run export api#519
Open
giulianoconte wants to merge 2 commits into
Open
Add keyset pagination and time-range filters to the run export api#519giulianoconte wants to merge 2 commits into
giulianoconte wants to merge 2 commits into
Conversation
GET /api/exports/runs now accepts optional limit / start / end / cursor params so clients can pull the run corpus in reliable chunks instead of one ~7GB stream that resets before finishing. With no params the response is unchanged (the full export). Runs are ordered by (submitted_at, _id): - limit=N bounds a page; when more runs follow, the response carries an X-Next-Cursor header to pass back as cursor=. Ascending order means runs submitted during a long bootstrap sort after the cursor, so a forward pager never misses or double-counts them. - start/end restrict to a half-open [start, end) submitted_at window for incremental "since my last sync" pulls. Backed by a new ascending (submitted_at, _id) index so the ordered scan runs from the index with no blocking in-memory sort (it's an index-ordered scan with a fetch+filter on character, not a covered query). Rate limiting moves to a per-request cost: a bounded page costs 1 against a 120/hour bucket while an unbounded full dump costs 60, preserving the old 2/hour ceiling for the heavy path. The run-export metric splits so paged pulls don't silently inflate the full-dump count: spire_codex_run_exports_total still counts unbounded dumps, and a new spire_codex_run_export_pages_total counts bounded pages. The "forward pager never misses a new run" guarantee holds only because every new run is stamped with a real submitted_at and so sorts at the end; a run inserted without one would fall into the leading null block and be silently dropped from the export. _ensure_run_validator enforces that with a $jsonSchema validator (submitted_at must be a BSON date), applied idempotently on first collection access alongside the indexes. Level "moderate" gates new inserts but spares updates to pre-existing legacy null docs; best-effort, so it logs and continues if collMod can't run.
Pure-function pytest coverage of the cursor codec, the keyset/range match builder, ISO parsing, and the rate-limit cost - no database or app needed (the router imports the collection lazily). Adds pytest as a single dev dependency (requirements-dev.txt) plus a minimal pytest.ini; the repo had no test suite. Run with 'pytest' from backend/.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Add keyset pagination and time-range filters to the run export api.
The bulk
GET /api/exports/runsis currently hundreds of thousands of runs. Pulling it in one shot is unreliable - the on-the-fly stream can get reset partway through (observed: Cloudflare closing the HTTP/2 stream before the dump completed), in addition to taking a while. This adds an opt-in way to pull it in resumable, bounded chunks. When calling the endpoint with no parameters results in the same behavior as before (a full export).Changes:
backend/app/routers/exports.pylimit,start,end,cursorparamsbackend/app/services/runs_db_mongo.py(submitted_at, _id)index_ensure_run_validatorvalidates that new run submissions havesubmitted_atbackend/app/metrics.pyrun_export_pagescounterbackend/tests/test_export_helpers.py,backend/pytest.ini,backend/requirements-dev.txtpytest(single dev dependency) + pure-function unit tests for the export pagination helpersBehavior
GET /api/exports/runsgains four optional query params:limitstartsubmitted_at(ISO-8601).endsubmitted_at(ISO-8601).cursorX-Next-Cursorheader.(submitted_at, _id)ascending, always.limit, a full page returns an opaqueX-Next-Cursorresponse header. Pass it back ascursor=to get the next page; its absence means you've reached the end. Ascending order means runs submitted during a paginated scan will sort after the cursor, so a forward pager never misses or double-counts. In other words, You can page through the whole corpus while clients submit run concurrently; the runs will be picked up before you've reached the end.start/endgive a half-open[start, end)window onsubmitted_at, for incremental "everything since my last sync" pulls. Combine withlimitto also bound each page.Implementation
runs_db_mongo.py): adds ascending(submitted_at, _id). The existing{submitted_at: -1}index is descending and lacks the_idtiebreaker, so it can't serve the ordered keyset scan. With the new index the scan is index-ordered with no blocking in-memory sort (IXSCAN -> FETCH -> LIMIT, no SORT stage). The export filters to official characters ($in OFFICIAL_CHARACTERS); sincecharacterisn't in the index, it stays a post-fetch filter (theFETCHevaluates it), so the query isn't covered. That's fine today because official-character runs dominate the corpus, so few fetched docs are rejected. The submit path does store modded-character runs (they're excluded only at read/snapshot time), so if modded runs ever became a large fraction of stored runs, a(character, submitted_at, _id)index would makecharacteran indexed predicate (SORT_MERGEover the official values, fully covered, no post-fetch rejection). I kept the simpler index since I assume the fraction to be negligible currently.cost: the route is120/hourwhere a bounded (limit) page costs 1 and a the full export costs 60 - so the old 2/hour for full export is reserved.submitted_at(older runs that predate upload-time tracking and haven't been stamped) sort as a leading block ordered by_id; the cursor handles that block explicitly, so a full bootstrap (nostart/end) still returns them. Astart/endwindow filters onsubmitted_atand therefore excludes them - intended for incremental sync, documented on the endpoint. Null volume doesn't degrade paging (the non-sparse index keys nulls too) - see Verification.spire_codex_run_exports_totalstill counts unbounded dumps; newspire_codex_run_export_pages_totalcounts bounded pages.runs_db_mongo.py): the export's "a forward pager never misses a new run" guarantee holds only because every new run sorts at the end - i.e. is stamped with a realsubmitted_at. A run inserted withoutsubmitted_atsorts into the leading null block. So_ensure_run_validatorattaches a$jsonSchemavalidator requiringsubmitted_atto be a BSONdate. ThevalidationLevel: "moderate"means the validator enforces new runs bot not pre-existing runs. Note that a bulk re-import of untimestamped legacy runs would now be rejected. You should be able to get around it by addingsubmitted_atin the import or settingvalidationAction: "warn"for that window.submitted_at|_id.Verification
Tested manually against a local instance (Docker, in-memory slowapi), seeded with runs. To reproduce:
Results:
[min, mid)and[mid, max]yields disjoint windows whose counts sum to the whole (e.g. 146 + 147 = 293), and[mid, mid)is empty - the run sitting exactly on the boundary lands in exactly one window, never both or neither.IXSCAN -> FETCH -> LIMIT, no SORT stage (step 6).SORT_MERGEover twoIXSCANs, examiningdocsExamined ~= 2x limit(sub-ms) - bounded by page size, not null count; no collection scan, no in-memory sort.cursor/start-> 400,limitoutside[1, 50000]-> 422, future-dated window -> empty 200. Validation runs before rate-limit consumption, so malformed requests don't burn budget.Notes
page+limit) and returns its metadata in the JSON body (total/page/per_page/has_next); this is the only endpoint using a keysetcursorand a data-bearingX-Next-Cursorresponse header. I diverged because: (1) the response is a gzipped JSONL stream, which has no JSON envelope to carrynext/has_nextwithout either wrapping (and breaking) the stream or violating the "no params = unchanged body" guarantee; and (2) offset pagination means a deepskip/OFFSETwalks and discards that many index entries per page (O(offset)) and double-counts or skips under concurrent inserts, whereas keyset is O(page-size) and insert-stable, which is what a full-corpus export needs. Let me know if you want to change the approach.{submitted_at: -1}index redundant.(submitted_at, _id)ascending is a superset: it serves the keyset export and everything that uses{submitted_at: -1}today - the unfiltered newest-first runs list (sort=date), the admin "last submission"find_one, and the 24h-count range - because MongoDB serves a sort and its reverse from one index and range filters ignore direction. I kept the previous index in this PR so it's additive-only. You could have just one(submitted_at: 1, _id: 1)or(submitted_at: -1, _id: -1)index to serve current use cases and this PR. The{username|character|user_id: 1, submitted_at: -1}compounds have different prefixes and are unaffected.submitted_at: 1, _id: -1, which afaict nothing in the codebase needs currently.submitted_atis stored as a BSONDateeverywhere. This repo flags ETL type drift elsewhere (e.g.win/was_abandonedas 0/1 vs bool). If any legacy ETL'd docs holdsubmitted_atas a string, BSON type-bracketing would make the ordered scan and$gt/$necomparisons silently skip them. I couldn't check against prod. I think you can check with this:dateandnull. Astringbucket means we'd want a normalization pass first.)submitted_atThe live submit path always fillssubmitted_at, so new runs get added at the end. If a future bug or query ever inserted runs withoutsubmitted_at, they'd be silently missed for in-progress paginated scans - nulls sort first, so they land behind any pager that has advanced past the null block (and thestart/endwindow excludes nulls entirely). They would only be picked up if you rescanned the null block from the beginning. This To mitigate this:$jsonSchemavalidator (see "How") makes a buggy null/missing insert a loud failure, applied automatically on deploy. If you'd rather ease it in, setvalidationAction: "warn"to log violations without rejecting.count_documents({submitted_at: null})trends up; under the invariant it should be flat or shrinking. A second signal even with the validator on.costkeys onlimitpresence, so a windowed-but-unbounded pull (start/end, nolimit) still costs 60. Defensible (it could return the whole corpus), and the guidance is simply "always passlimitfor the cheap path."X-Next-Cursorisn't inexpose_headers, so a browserfetchcan't read it. The current consumer is a C#HttpClient(unaffected); if your frontend ever consumes the export, addexpose_headers=["X-Next-Cursor"].backend/tests/test_export_helpers.py). The repo had no test suite, so this addspytestas a single dev dependency (requirements-dev.txt+ a minimalpytest.ini); run withpytestfrombackend/. The added tests are pure-function unit tests - no database, no app, no CI needed (the router imports_get_collectionlazily, so the helpers import clean): cursor encode/decode round-trip (including the null-submitted_atblock) and malformed-cursor rejection (a real base64 error and a valid-base64-but-no-separator token, both -> 400),_build_matchkeyset/range clause construction (no-params, half-open window, null-block continuation, past-nulls continuation), ISO parsing + its 400, and the rate-limitcost. For more test coverage, we need a Mongo-backed test fixture, which this PR does not introduce. This would allow endpoint behavior tests, e.g. paginate == full export and range partitioning.