Skip to content

Make df.list_instances() page-size cap a loud, tunable error (#146)#280

Open
crprashant wants to merge 1 commit into
microsoft:mainfrom
crprashant:crprashant/list-instances-max-limit-guc
Open

Make df.list_instances() page-size cap a loud, tunable error (#146)#280
crprashant wants to merge 1 commit into
microsoft:mainfrom
crprashant:crprashant/list-instances-max-limit-guc

Conversation

@crprashant

Copy link
Copy Markdown
Contributor

What

Replace the silent page-size clamp in df.list_instances() with a loud, tunable error.

Both overloads of df.list_instances() previously truncated limit_count silently to a fixed ceiling of 10000. A caller asking for more rows got a short page back with no signal that truncation happened — indistinguishable from "no more rows". This PR replaces that clamp with a new GUC and a fail-fast guard:

  • New GUC pg_durable.list_instances_max_limitint, default 1000, range 11000000, SUSET context.
  • A call with limit_count above the cap now raises an error that names the GUC and points the caller at keyset pagination, instead of silently returning fewer rows.
-- default cap is 1000
SELECT * FROM df.list_instances(NULL, 5000);
-- ERROR: df.list_instances: limit_count (5000) exceeds
--        pg_durable.list_instances_max_limit (1000); request fewer rows,
--        or use the paginated df.list_instances overload
--        (after_cursor/next_cursor) for large result sets

-- a superuser can raise the cap at runtime, no restart:
SET pg_durable.list_instances_max_limit = 5000;

Both the basic 2-arg overload and the paginated 4-arg overload (PR #278) share one guard, enforce_list_instances_limit(). df.instance_executions() intentionally keeps its own fixed min(10000) clamp — it reports a single instance's execution history, a different surface with a naturally small result set.

Files: src/lib.rs (GUC), src/monitoring.rs (shared guard + both call sites), docs, e2e.

Why

This is PR 5 of the agreed, incremental monitoring-functions plan on #167, and it lands pinodeca's P0 from #146: a silently short page breaks external clients that page by limit_count, because a truncated page looks identical to the last page. PR #278 already shipped keyset pagination (after_cursor/next_cursor) as the correct path for large result sets; this PR makes an over-cap request fail fast and steer the caller there, rather than returning misleading data.

Design note — SUSET context

The cap is read on the df.list_instances() query path, so a superuser must be able to tune it per session at runtime — hence SUSET rather than Postmaster. SUSET is also the guardrail: by default an ordinary caller cannot raise the cap to bypass it (a superuser can delegate that with GRANT SET ON PARAMETER pg_durable.list_instances_max_limit). The parameter is not SUPERUSER_ONLY, so any caller can still SHOW it to see why a large request errored.

Upgrade & compatibility

  • No SQL/schema change. The GUC is registered in _PG_init at the C level — there is no sql/ upgrade script, no .control/Cargo version bump, and no Scenario A snapshot impact.
  • Scenario B1 (binary backward compat). The new .so works against all prior df schemas. The guard runs before any SQL is issued and reads no catalog or table state — it only compares the caller's limit_count against the in-memory GUC — so it is correct against a pre-0.2.4 schema that has not run ALTER EXTENSION UPDATE. The only visible difference against an old schema is the intended one: a large limit_count now errors instead of being silently capped at 10000.
  • This follows the runtime-only precedent of pg_durable.enable_superuser_instances. test-upgrade is not required (no schema change); it is documented in docs/upgrade-testing.md alongside that precedent.

Behavior change (flagged): the default ceiling drops from 10000 to 1000, and over-cap requests now error instead of truncating. Recorded in CHANGELOG.md under the unreleased 0.2.4 changeset. 0.2.4 is still unreleased, so there is no version bump.

Testing

Full CONTRIBUTING pre-PR workflow, all green:

  • cargo fmt -p pg_durable -- --check — clean
  • cargo build --features pg17 / cargo clippy --features pg17 — clean, no warnings
  • scripts/test-unit.sh194 passed, 0 failed
  • scripts/test-e2e-local.sh38/38. tests/e2e/sql/05_monitoring_and_explain.sql adds a three-part block (run as superuser after RESET SESSION AUTHORIZATION): (1) default cap raises on both overloads and the message names the GUC, with the at-cap page (1000) provably allowed; (2) lowering the cap to 3 allows the at-limit page and raises on 4 echoing (3) for both overloads; (3) a non-superuser set_config(...) on the GUC raises insufficient_privilege (asserted on SQLSTATE 42501).

This branch is based on the latest main (includes #277/#278).

Scope

Truncation-policy GUC only. df.instance_executions() keeps its own fixed clamp (out of this PR's minimal scope; give it its own bound as a follow-up if per-instance history ever needs one).

cc @pinodeca

…ft#146)

Replace the silent limit_count.min(10000) clamp in both df.list_instances() overloads with a loud error when limit_count exceeds the new pg_durable.list_instances_max_limit GUC (SUSET, default 1000, range 1..1000000). An over-cap request now fails fast and points the caller at keyset pagination (after_cursor/next_cursor) instead of returning a silently short page that is indistinguishable from 'no more rows'.

The cap is superuser-settable at runtime without a restart; by default an ordinary caller cannot raise it, so the guardrail holds. df.instance_executions keeps its own fixed min(10000) clamp (a different per-instance-history surface).

No SQL/schema change: the GUC is registered in _PG_init at the C level, so there is no upgrade script and binary backward-compatibility holds trivially against all prior df schemas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant