diff --git a/CHANGELOG.md b/CHANGELOG.md index 445c1a2..4f786d9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,7 @@ Pre-1.0 note: while `pg_durable` is in major version `0`, minor releases may inc - **Instance/node ID collision hardening (#129):** `df.start()` now reserves IDs with `INSERT ... ON CONFLICT DO NOTHING RETURNING id` and re-rolls the random 8-hex value on collision — instances arbitrate on the `df.instances` primary key (`id`), nodes on the new composite `PRIMARY KEY (instance_id, id)` — replacing the previous `SELECT EXISTS` pre-check. Doing the conflict check at the index level (rather than a pre-check `SELECT`) closes a TOCTOU window and, for instances, an RLS blind spot where the pre-check could not see another role's rows. `df.nodes` now uses the composite `PRIMARY KEY (instance_id, id)` instead of a global single-column key, so the random 8-hex node ID is no longer the sole cross-instance collision guard. The `update-node-status` activity now scopes its `df.nodes` update by `instance_id` (a required activity-input field) and asserts it affects exactly one row. IDs stay `VARCHAR(8)` HEX; the `0.2.3 → 0.2.4` upgrade restructures the `df.nodes` keys in place (#238). - **Breaking for in-flight work:** the new activity-input shape changes the string duroxide records in orchestration history, and duroxide validates activity inputs by exact equality on replay, so any instance left **in flight across the 0.2.3 → 0.2.4 binary upgrade** cannot complete. Drain or cancel in-flight instances before deploying 0.2.4. The in-place `df.nodes` key restructure also takes an `ACCESS EXCLUSIVE` lock whose duration scales with table size — run the upgrade in a maintenance window. See the #129 section of `docs/upgrade-testing.md` for the full drain-before-upgrade contract. - **`df.grant_usage()` / `df.revoke_usage()`:** dropped the explicit per-function `EXECUTE` allowlist. Schema `USAGE` on `df` is the real access gate for ordinary `df.*` functions, so the helpers now grant/revoke schema `USAGE`, the table privileges, and `EXECUTE` only on the sensitive functions (`df.http`, `df.grant_usage`, `df.revoke_usage`). Function signatures are unchanged and existing privileges are unaffected (#242). +- **`df.list_instances()` page-size cap is now a loud error (#146):** `df.list_instances()` previously truncated `limit_count` silently to a fixed ceiling of 10000. It now raises an error when `limit_count` exceeds the new `pg_durable.list_instances_max_limit` GUC (`SUSET` context, default `1000`, range `1`–`1000000`), so an over-cap request fails fast instead of returning a silently short page that is indistinguishable from "no more rows". Both the basic and paginated overloads enforce the cap; clients needing more rows should lower `limit_count` or use the paginated overload (`after_cursor`/`next_cursor`). A superuser can change the cap at runtime without a restart; by default an ordinary caller cannot. ### Removed diff --git a/USER_GUIDE.md b/USER_GUIDE.md index a2b8c2c..4ce1b9b 100644 --- a/USER_GUIDE.md +++ b/USER_GUIDE.md @@ -1466,6 +1466,8 @@ Filters (`status_filter`, `label_filter`) are sticky across pages — keep passi > **Note:** `next_cursor` advances over `df.instances` independently of the per-row execution-metadata lookup. In a brief start-up window a freshly-submitted instance can appear in `df.instances` before its execution metadata is queryable and is omitted from that page; in the rare case where *every* row of a non-final page is omitted, the page returns zero rows (so `next_cursor` can't be read) — retry shortly. +> **Page-size limit:** `limit_count` is capped by the `pg_durable.list_instances_max_limit` GUC (default `1000`). A request for more rows than the cap raises an error instead of being silently truncated — lower `limit_count` or use the paginated overload (`after_cursor`/`next_cursor`) for larger result sets. A superuser can raise the cap at runtime (`ALTER SYSTEM SET pg_durable.list_instances_max_limit = …; SELECT pg_reload_conf();`); by default ordinary callers cannot change it. See [docs/api-reference.md](docs/api-reference.md#pg_durablelist_instances_max_limit). + ### Instance Details ```sql @@ -1939,6 +1941,8 @@ pg_durable.max_user_connections = 10 pg_durable.execution_acquire_timeout = 30 ``` +> **Other GUCs:** `pg_durable.list_instances_max_limit` (SUSET context, default `1000`) caps the per-call page size of `df.list_instances()`. Unlike the connection-limit GUCs above, it is superuser-settable at runtime (no restart) and is not loaded from `postgresql.conf` at startup only. See [docs/api-reference.md](docs/api-reference.md#pg_durablelist_instances_max_limit). + ### Connection Budget Formula To calculate the total connections pg_durable will use: diff --git a/docs/api-reference.md b/docs/api-reference.md index d90671a..332686c 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -351,7 +351,7 @@ The two overloads have non-overlapping arities (basic matches 0–2 arguments, p | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `status_filter` | TEXT | `NULL` (basic only) | Only instances with this status (lowercase: `pending`, `running`, `completed`, `failed`, `cancelled`). `NULL` = any. | -| `limit_count` | INTEGER | `100` (basic only) | Max rows per page (must be ≥ 1; capped at 10000). | +| `limit_count` | INTEGER | `100` (basic only) | Max rows per page (must be ≥ 1). A request above `pg_durable.list_instances_max_limit` (default 1000) raises an error instead of being silently truncated — lower `limit_count`, or use the paginated overload (`after_cursor`) for larger result sets. | | `label_filter` | TEXT | — (required to select the paginated overload) | Only instances whose label equals this value (issue #87). `NULL` = any. | | `after_cursor` | TEXT | `NULL` | Opaque keyset cursor from a prior page's `next_cursor`; returns the page that sorts strictly after it (issue #146). `NULL` = first page. | @@ -607,3 +607,31 @@ SHOW pg_durable.enable_superuser_instances; **Security note:** Setting this GUC to `on` in a multi-tenant environment allows any role with `BYPASSRLS` to forge `submitted_by` to a superuser OID and execute arbitrary SQL as superuser. Keep `off` unless you have a specific need and understand the risk. See [docs/superuser_guc.md](superuser_guc.md) for the full threat analysis. +--- + +### pg_durable.list_instances_max_limit + +Maximum number of rows `df.list_instances()` returns in a single call. A request for more rows than this raises an error instead of silently truncating the result, so external clients paginate explicitly (via `after_cursor`/`next_cursor`) rather than relying on a silent cap. + +| Property | Value | +|----------|-------| +| Type | `integer` | +| Default | `1000` | +| Range | `1` – `1000000` | +| Context | `SUSET` (superuser can change at runtime; no restart needed) | + +Both `df.list_instances()` overloads (basic and paginated) enforce this cap. By default an ordinary (non-superuser) caller cannot raise it, so the guardrail holds from a user session — a superuser may delegate that ability with `GRANT SET ON PARAMETER pg_durable.list_instances_max_limit TO `, but without that grant it stays superuser-settable only. + +> **Sizing note:** `df.list_instances()` materializes up to `limit_count` rows per call, so raise the cap only as high as a single response should reasonably hold. For very large exports, prefer paging with `after_cursor`/`next_cursor` over one huge page rather than setting the cap near its maximum. + +```sql +-- Inspect the current cap +SHOW pg_durable.list_instances_max_limit; + +-- Raise it for an admin reporting workload (requires superuser) +ALTER SYSTEM SET pg_durable.list_instances_max_limit = 5000; +SELECT pg_reload_conf(); +``` + +> **Behavior change (v0.2.4):** prior to v0.2.4, `df.list_instances()` silently truncated `limit_count` to 10000. It now raises an error when `limit_count` exceeds this GUC (default 1000). Callers that previously requested very large pages should lower `limit_count` or use the paginated overload (`after_cursor`/`next_cursor`). + diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md index 8510bfe..e690034 100644 --- a/docs/upgrade-testing.md +++ b/docs/upgrade-testing.md @@ -254,6 +254,15 @@ what the upgrade script handles, and any backward compatibility considerations. - **Scenario B2 considerations:** No data migration. The `CREATE FUNCTION` adds catalog metadata only; `df.instances` rows are untouched. The new `created_at`/`completed_at` result columns are read from columns that already exist and are already populated on every prior install. - **Dependent-object note:** Because the upgrade only adds a function (no `DROP FUNCTION`), no customer-owned object that depends on the existing two-argument `df.list_instances` is affected — there is nothing to drop or repoint. +#### `pg_durable.list_instances_max_limit` GUC — page-size cap is now a loud error (issue #146) +- **DDL change:** None. The cap is enforced entirely in the `.so`: a new `pg_durable.list_instances_max_limit` GUC (`SUSET` context, default `1000`, range `1`–`1000000`) is registered in `_PG_init` (`src/lib.rs`) and read on the `df.list_instances()` query path (`src/monitoring.rs`). There are no SQL function, table, or index changes, so this change adds no upgrade-script DDL and has no Scenario A snapshot impact. +- **Behavior change:** Both `df.list_instances()` overloads previously truncated `limit_count` silently to a fixed 10000. They now raise an error when `limit_count` exceeds the GUC (default `1000`), so an over-cap request fails fast instead of returning a silently short page. This is a runtime behavior change to a function that shipped in 0.2.2/0.2.3; it is recorded in `CHANGELOG.md` under the unreleased 0.2.4 changeset. +- **Scenario A considerations:** No schema changes — the `df` schema equivalence contract is unchanged. +- **Scenario B1 considerations:** The new `.so` works against all previous schemas. The guard runs before any SQL is issued and reads no catalog or table state — it only compares the caller's `limit_count` against an in-memory GUC value — so it is correct against a pre-0.2.4 schema that has not run `ALTER EXTENSION UPDATE`. The only visible difference against an old schema is the intended one: a large `limit_count` now errors instead of being silently capped at 10000. +- **Scenario B2 considerations:** No data migration. The GUC has no effect on schema shape or existing data. + +This follows the runtime-only precedent of `pg_durable.enable_superuser_instances` below (DDL change: None, enforced entirely in the `.so`). + ### v0.2.2 → v0.2.3 #### Rename duroxide provider schema to `_duroxide` for fresh installs diff --git a/src/lib.rs b/src/lib.rs index 14c6f8c..482e7e1 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -29,6 +29,16 @@ pub static EXECUTION_ACQUIRE_TIMEOUT: GucSetting = GucSetting::::new(3 /// functions are explicitly desired. See docs/superuser_guc.md. pub static ENABLE_SUPERUSER_INSTANCES: GucSetting = GucSetting::::new(false); +/// Maximum page size (`limit_count`) accepted by `df.list_instances()`. A call +/// requesting more rows than this raises an error instead of silently truncating +/// the result, steering external clients toward keyset pagination +/// (`after_cursor`/`next_cursor`) for large result sets (issue #146). Superuser- +/// settable (Suset) so it can be tuned at runtime without a restart. +/// +/// NOTE: the default (`1000`) and max (`1_000_000`) below are also documented in +/// `docs/api-reference.md` and `USER_GUIDE.md`; update those mirrors if you change them. +pub static LIST_INSTANCES_MAX_LIMIT: GucSetting = GucSetting::::new(1000); + // Module declarations pub mod activities; pub mod client; @@ -136,6 +146,23 @@ pub extern "C-unwind" fn _PG_init() { GucFlags::SUPERUSER_ONLY, ); + // First Suset-context GUC in this extension (the connection-limit and + // enable_superuser_instances GUCs above are Postmaster). Suset is deliberate: + // this guardrail is read on the df.list_instances() query path, so a superuser + // must be able to tune it per-session at runtime. Its long description is + // intentionally populated (unlike the c"" connection GUCs) because the + // raise-instead-of-truncate behavior is not obvious from the name alone. + GucRegistry::define_int_guc( + c"pg_durable.list_instances_max_limit", + c"Maximum number of rows df.list_instances() returns in a single call before raising an error", + c"A call to df.list_instances() with limit_count above this value raises an error instead of silently truncating the result. Clients needing more rows should use the paginated df.list_instances overload (after_cursor/next_cursor). Superusers can change this at runtime without a restart; by default ordinary callers cannot raise it.", + &LIST_INSTANCES_MAX_LIMIT, + 1, + 1_000_000, + GucContext::Suset, + GucFlags::default(), + ); + worker::register_background_worker(); } diff --git a/src/monitoring.rs b/src/monitoring.rs index bcb901f..a707c12 100644 --- a/src/monitoring.rs +++ b/src/monitoring.rs @@ -97,6 +97,29 @@ fn cursor_timestamp_well_formed(ts: &str) -> bool { }) } +/// Enforce the configurable upper bound on a `df.list_instances()` page size. +/// +/// Both `df.list_instances` overloads share this guard. A `limit_count` below 1 is +/// always rejected. When it exceeds `pg_durable.list_instances_max_limit` (default +/// 1000) the call raises a loud error rather than silently truncating to the cap: +/// a silently short page is indistinguishable from "no more rows" and breaks +/// external clients that page by `limit_count`. Such clients should use keyset +/// pagination (`after_cursor`/`next_cursor`) for large result sets (issue #146). +/// The cap is a superuser-settable (Suset) GUC, so it is tunable at runtime without +/// a restart and, by default, cannot be raised by an ordinary caller — the guardrail +/// holds for ordinary user sessions unless a superuser delegates SET on the parameter. +fn enforce_list_instances_limit(limit_count: i32) { + if limit_count < 1 { + pgrx::error!("limit_count must be at least 1"); + } + let max_limit = crate::LIST_INSTANCES_MAX_LIMIT.get(); + if limit_count > max_limit { + pgrx::error!( + "df.list_instances: limit_count ({limit_count}) exceeds pg_durable.list_instances_max_limit ({max_limit}); request fewer rows, or use the paginated df.list_instances overload (after_cursor/next_cursor) for large result sets" + ); + } +} + /// Batch-fetch (function_name, execution_count, output) for a set of instance ids /// from duroxide's published `.get_instance_info(TEXT)` SQL function. /// @@ -214,10 +237,7 @@ pub fn list_instances( name!(output, Option), ), > { - if limit_count < 1 { - pgrx::error!("limit_count must be at least 1"); - } - let limit_count = limit_count.min(10000); + enforce_list_instances_limit(limit_count); let pg_conn_str = postgres_connection_string(); let provider_schema = backend_duroxide_schema(); @@ -325,10 +345,7 @@ pub fn list_instances_paged( name!(next_cursor, Option), ), > { - if limit_count < 1 { - pgrx::error!("limit_count must be at least 1"); - } - let limit_count = limit_count.min(10000); + enforce_list_instances_limit(limit_count); // Decode the opaque keyset cursor up front. A malformed cursor is a client // error (the token must be passed back verbatim from a prior next_cursor), so @@ -636,6 +653,11 @@ pub fn instance_executions( if limit_count < 1 { pgrx::error!("limit_count must be at least 1"); } + // NOTE: df.instance_executions reports a single instance's execution history + // (keyed by instance_id) — a different surface from df.list_instances, with a + // naturally small result set. It intentionally keeps this fixed min(10000) + // clamp rather than the tunable pg_durable.list_instances_max_limit GUC + // (PR5 / #146); give it its own bound if per-instance history ever needs one. let limit_count = limit_count.min(10000); let pg_conn_str = postgres_connection_string(); diff --git a/tests/e2e/sql/05_monitoring_and_explain.sql b/tests/e2e/sql/05_monitoring_and_explain.sql index 62f2654..c68d391 100644 --- a/tests/e2e/sql/05_monitoring_and_explain.sql +++ b/tests/e2e/sql/05_monitoring_and_explain.sql @@ -520,4 +520,113 @@ END $fail$; DROP TABLE _fail_state; RESET SESSION AUTHORIZATION; + +-- ============================================================================ +-- list_instances max-limit GUC (pg_durable.list_instances_max_limit, PR5 / #146) +-- Runs as the postgres superuser (after RESET SESSION AUTHORIZATION) so it can +-- SET the Suset-context GUC. Verifies: (1) default cap raises a loud error on +-- both overloads instead of silently truncating, (2) the cap is tunable at +-- runtime, and (3) an ordinary (non-superuser) caller cannot raise it. +-- ============================================================================ + +-- (1) Default cap (1000): a request above the cap raises on BOTH overloads. +DO $lim_default$ +DECLARE + msg TEXT; +BEGIN + -- Basic overload (2 args). + BEGIN + PERFORM * FROM df.list_instances(NULL, 1001); + RAISE EXCEPTION 'TEST FAILED: list_instances(NULL, 1001) did not raise (expected over-limit error)'; + EXCEPTION WHEN OTHERS THEN + msg := SQLERRM; + IF strpos(msg, 'list_instances_max_limit') = 0 THEN + RAISE EXCEPTION 'TEST FAILED: basic overload error did not mention the GUC: %', msg; + END IF; + END; + + -- Paginated overload (3 args). + BEGIN + PERFORM * FROM df.list_instances(NULL, 1001, NULL); + RAISE EXCEPTION 'TEST FAILED: list_instances(NULL, 1001, NULL) did not raise (expected over-limit error)'; + EXCEPTION WHEN OTHERS THEN + msg := SQLERRM; + IF strpos(msg, 'list_instances_max_limit') = 0 THEN + RAISE EXCEPTION 'TEST FAILED: paginated overload error did not mention the GUC: %', msg; + END IF; + END; + + -- At the default cap (1000): must NOT raise on either overload. This pins the + -- boundary to "> cap" (not ">= cap") so the at-limit page is provably allowed. + PERFORM * FROM df.list_instances(NULL, 1000); + PERFORM * FROM df.list_instances(NULL, 1000, NULL); + + RAISE NOTICE 'TEST PASSED: default list_instances_max_limit raises over-limit error'; +END $lim_default$; + +-- (2) The cap is tunable at runtime: lower it, confirm at-limit passes and +-- over-limit raises with the new ceiling echoed in the message. +SET pg_durable.list_instances_max_limit = 3; +DO $lim_tunable$ +DECLARE + msg TEXT; +BEGIN + -- At the limit: must NOT raise (basic and paginated overloads). + PERFORM * FROM df.list_instances(NULL, 3); + PERFORM * FROM df.list_instances(NULL, 3, NULL); + + -- Over the (lowered) limit: must raise and echo the configured ceiling (basic). + BEGIN + PERFORM * FROM df.list_instances(NULL, 4); + RAISE EXCEPTION 'TEST FAILED: list_instances(NULL, 4) did not raise with cap = 3'; + EXCEPTION WHEN OTHERS THEN + msg := SQLERRM; + IF strpos(msg, 'list_instances_max_limit (3)') = 0 THEN + RAISE EXCEPTION 'TEST FAILED: over-limit error did not echo the configured cap (3): %', msg; + END IF; + END; + + -- Same guard on the paginated overload: must raise and echo the same ceiling. + BEGIN + PERFORM * FROM df.list_instances(NULL, 4, NULL); + RAISE EXCEPTION 'TEST FAILED: list_instances(NULL, 4, NULL) did not raise with cap = 3'; + EXCEPTION WHEN OTHERS THEN + msg := SQLERRM; + IF strpos(msg, 'list_instances_max_limit (3)') = 0 THEN + RAISE EXCEPTION 'TEST FAILED: paginated over-limit error did not echo the configured cap (3): %', msg; + END IF; + END; + + RAISE NOTICE 'TEST PASSED: list_instances_max_limit is tunable at runtime'; +END $lim_tunable$; +RESET pg_durable.list_instances_max_limit; + +-- (3) Guardrail: a non-superuser cannot raise the cap (Suset context). +SET SESSION AUTHORIZATION df_e2e_user; +DO $lim_guard$ +DECLARE + allowed BOOLEAN := false; + sqlstate_caught TEXT; +BEGIN + BEGIN + PERFORM set_config('pg_durable.list_instances_max_limit', '1000000', false); + allowed := true; -- set_config returned without error => guardrail breached + EXCEPTION + WHEN insufficient_privilege THEN + NULL; -- expected: SQLSTATE 42501, permission denied to set parameter + WHEN OTHERS THEN + sqlstate_caught := SQLSTATE; + IF sqlstate_caught <> '42501' THEN + RAISE EXCEPTION 'TEST FAILED: expected insufficient_privilege (42501) when a non-superuser raises the GUC, got SQLSTATE %: %', sqlstate_caught, SQLERRM; + END IF; + END; + + IF allowed THEN + RAISE EXCEPTION 'TEST FAILED: non-superuser was allowed to raise list_instances_max_limit'; + END IF; + + RAISE NOTICE 'TEST PASSED: non-superuser cannot raise list_instances_max_limit'; +END $lim_guard$; +RESET SESSION AUTHORIZATION; + SELECT 'TEST PASSED' AS result;