Skip to content

[fm] Guard support bundle creation with SitrepGuardedInsert.#10535

Open
mergeconflict wants to merge 21 commits into
mainfrom
mergeconflict/fm-sitrepguardedinsert-support-bundles
Open

[fm] Guard support bundle creation with SitrepGuardedInsert.#10535
mergeconflict wants to merge 21 commits into
mainfrom
mergeconflict/fm-sitrepguardedinsert-support-bundles

Conversation

@mergeconflict

@mergeconflict mergeconflict commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Wire the support bundle resource through SitrepGuardedInsert:

  • impl SitrepGuardedResource for SupportBundle;
  • schema: support_bundle_generation on fm_sitrep and the rendezvous_support_bundle_created marker table (migration fm-bundle-resource-deletion);
  • support_bundle_create's FM path routes through the combinator, surfacing a stale sitrep as Error::Conflict;
  • SitrepBuilder tracks support_bundle_generation, bumping it when the outstanding support-bundle-request set changes; the closed-case carry-forward filter drops fully-satisfied closed cases and keeps those with unsatisfied support bundle requests;
  • fm_rendezvous reads the expected generation and aborts the support bundle loop on a stale mismatch; fm_analysis loads existing markers to drive carry-forward; omdb displays the new status fields and generation.

support_bundle_generation and alert_generation are tracked and guarded independently, so a stale generation for one resource aborts only that resource's rendezvous loop.

Context: #10248, builds on #10564 and #10533.

@mergeconflict mergeconflict self-assigned this Jun 2, 2026
@mergeconflict mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Jun 2, 2026
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from e15c2c1 to e5d67e2 Compare June 2, 2026 19:32
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 1bc2f60 to 62cc3dd Compare June 2, 2026 19:32
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from e5d67e2 to 809d7d3 Compare June 2, 2026 19:50
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 62cc3dd to 3d5af68 Compare June 2, 2026 19:50
@AlejandroME AlejandroME added this to the 21 milestone Jun 4, 2026
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from 809d7d3 to 7128ff2 Compare June 4, 2026 20:56
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch 2 times, most recently from a1beef9 to e97748d Compare June 5, 2026 16:38
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from 7128ff2 to 1574845 Compare June 5, 2026 16:38
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from e97748d to f3832d4 Compare June 8, 2026 16:31
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from 1574845 to 4d0f1dd Compare June 8, 2026 16:31
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from f3832d4 to cbc506b Compare June 8, 2026 16:41
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch 2 times, most recently from 85edf6b to a3ab6c2 Compare June 8, 2026 16:52
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from cbc506b to 59680b0 Compare June 8, 2026 16:52
Add the `SitrepGuardedInsert` Diesel combinator and the
`SitrepGuardedResource` trait: a generic primitive for FM rendezvous to
insert a resource row idempotently and guarded against stale-sitrep
execution.

The combinator wraps a caller-supplied resource INSERT in a single CTE
statement that:

  - aborts (StaleSitrep) unless the executor's expected generation still
    equals the latest sitrep's generation column;
  - short-circuits (AlreadyExists) if a creation marker already exists for
    the resource id;
  - on a successful insert, atomically writes a creation marker, gated by
    `WHERE EXISTS (SELECT 1 FROM new_resource)` so a marker is never
    fabricated for a pre-existing row.

All spliced SQL identifiers come from the trait's `&'static str` consts, so
the query is injection-safe. The result is surfaced as a
`SitrepGuardedInsertOutcome` of Created / AlreadyExists / StaleSitrep.
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 59680b0 to b6f5bfe Compare June 8, 2026 17:57
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from a3ab6c2 to 26cbbe1 Compare June 8, 2026 17:57
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from b6f5bfe to a7d6f65 Compare June 8, 2026 20:08
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from 26cbbe1 to da05cf5 Compare June 8, 2026 20:08
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from a7d6f65 to aee00ec Compare June 8, 2026 20:12
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from da05cf5 to 064090e Compare June 8, 2026 20:12
Wire the alert resource through `SitrepGuardedInsert`:

  - `impl SitrepGuardedResource for Alert`;
  - schema: `alert_generation` on `fm_sitrep` and the
    `rendezvous_alert_created` marker table (migration
    fm-alert-resource-deletion);
  - `alert_create`'s FM path routes through the combinator, surfacing a
    stale sitrep as a monomorphic `Error::Conflict`;
  - `SitrepBuilder` tracks `alert_generation`, bumping it when the
    outstanding alert-request set changes; the closed-case carry-forward
    filter drops fully-satisfied closed cases;
  - fm_rendezvous reads the expected generation and aborts the alert loop
    on a stale mismatch; fm_analysis loads existing markers to drive
    carry-forward; omdb displays the new status fields and generation.
The support-bundle mirror of the preceding alert change: `impl
SitrepGuardedResource for SupportBundle`, the `support_bundle_generation`
column and `rendezvous_support_bundle_created` marker table (migration
fm-bundle-resource-deletion), `support_bundle_create`'s FM path routed
through the combinator inside its transaction, support-bundle generation
tracking and carry-forward, the fm_rendezvous bundle loop, the fm_analysis
bundle-marker lookup, and the omdb display.

`support_bundle_generation` and `alert_generation` are tracked and guarded
independently, so a stale generation for one resource aborts only that
resource's rendezvous loop.
hawkw pushed a commit that referenced this pull request Jun 10, 2026
Add the `SitrepGuardedInsert` Diesel combinator and the
`SitrepGuardedResource` trait: a generic primitive for FM rendezvous to
insert a resource row idempotently and guarded against stale-sitrep
execution.

The combinator wraps a caller-supplied resource INSERT in a single CTE
statement that:

- aborts (StaleSitrep) unless the executor's expected generation still
equals the latest sitrep's generation column;
- short-circuits (AlreadyExists) if a creation marker already exists for
the resource id;
  - on a successful insert, atomically writes a creation marker.

The result is surfaced as a `SitrepGuardedInsertOutcome` of `Created` /
`AlreadyExists` / `StaleSitrep`.

Context: #10248. This PR
was previously #10532 but I made a mess of it. This is used in
#10533 and
#10535 which are split out
in hopes of making the review somewhat less miserable.
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from aee00ec to aab93e8 Compare June 10, 2026 21:35
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-support-bundles branch from 064090e to 5f46981 Compare June 10, 2026 21:35
Rename first_sitrep_with_alert_request_bumps_generation ->
first_sitrep_with_alert_request_bumps_alert_generation so the alert test
reads clearly alongside the support bundle generation tests added later.
@mergeconflict mergeconflict requested a review from smklein June 11, 2026 16:17
Base automatically changed from mergeconflict/fm-sitrepguardedinsert-alerts to main June 18, 2026 03:59
Comment thread nexus/db-model/src/schema_versions.rs Outdated
Comment on lines 128 to 144
/// For [`SupportBundleProvenance::Fm`] the bundle INSERT is gated by
/// [`SitrepGuardedInsert`], which also writes the
/// `rendezvous_support_bundle_created` marker atomically. The
/// transaction encloses the guarded insert and the data-selection
/// inserts, so a marker is never durable without its data-selection
/// rows. If a bundle with the given id already exists, returns
/// [`Error::ObjectAlreadyExists`]. If the latest sitrep's
/// `support_bundle_generation` has advanced past
/// `expected_support_bundle_generation`, returns [`Error::Conflict`]; the
/// caller should skip the request rather than retrying.
///
/// The free-dataset lookup runs before the generation guard, so a stale
/// rendezvous activation that also finds no free dataset reports
/// [`Error::insufficient_capacity`] rather than [`Error::Conflict`].
/// This only affects status reporting and log noise: the guard still
/// prevents a stale activation from inserting anything.
pub async fn support_bundle_create(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. similarly to the suggestion I left on #10533, I wonder if we might want to refactor this as separate support_bundle_create and fm_rendezvous_support_bundle_create methods, rather than passing in a "provenance" type that tells the method which behavior to do --- I can't really imagine having code that wants to be able to call support_bundle_create with either provenance dynamically; we're only doing one code path or the other, so it feels a bit nicer to me for them to just be separate methods.

with that said, I realize that (unlike alerts) the provenance enum was already here, so 🤷‍♀️

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a reason for keeping the switch-on-provenance thing here, but now I honestly can't remember what it was or whether it was a good reason. I'll think about this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I messed with this a bit and decided to stop messing with it. The direction it was going, I had three separate SupportBundleCreateParams structs: one for support_bundle_create, another for fm_rendezvous_support_bundle_create, and a third for their shared helper function (which had some super fun trait bounds for the insert query closure)... It was getting ugly, and then my efforts to de-uglify all pointed in the direction of "hey wouldn't it be nice to just switch on provenance."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction it was going, I had three separate SupportBundleCreateParams structs: one for support_bundle_create, another for fm_rendezvous_support_bundle_create, and a third for their shared helper function (which had some super fun trait bounds for the insert query closure)

Maybe there's an obvious reason this won't work that I'm missing, but why can't we just have something like this:

pub async fn support_bundle_create(&self, params: SupportBundleCreateParams) -> ...
pub async fn fm_support_bundle_create(
    &self,
    params: SupportBundleCreateParams,
    case_id: CaseUuid,
    id: SupportBundleUuid,
    expected_generation: Generation,
) -> ...

rather than having three params types? I wouldn't necessarily expect that there has to be a shared helper function if the code is different enough (but there can be if and only if it makes sense).

With that all said, if you don't feel like this is better, it's fine to do the provenance thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that all said, if you don't feel like this is better, it's fine to do the provenance thing.

I mean, I don't know, It's definitely bit better in some respects, but I don't know how much better. I've had Claude help me throw together a strawman draft PR, #10665, to show what it might look like. The basic issue I keep beating my head against it is that support_bundle_create and fm_rendezvous_support_bundle_create are structurally very similar -- in #10665 they're basically copy-pastes -- but my various attempts to refactor for sharing all ended up being gross and overbuilt. For example, my most recent attempt involved a policy trait comprising a function for "do the fucking insert" and an associated type for error returns; there was a lot of boilerplate there.

What's your feeling?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergeconflict what say you about #10667? (note: if you merge this there will be some additional cleanup to actually make rendezvous use the new thing, remove all the Provenances, et cetera)

Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment thread nexus/db-queries/src/db/sitrep_guard.rs Outdated
Comment thread nexus/fm/src/analysis_input.rs Outdated
Comment thread nexus/fm/src/builder.rs Outdated
Comment thread nexus/fm/src/builder.rs Outdated
Comment on lines +303 to +415
/// Verifies that the bump is relative to the parent sitrep's generation,
/// not restarted from a fixed initial value.
#[test]
fn child_sitrep_with_new_alert_bumps_alert_generation() {
fn child_sitrep_with_both_new_requests_bumps_both_generations() {
// Verifies that the bump is relative to the parent sitrep's generation,
// not restarted from a fixed initial value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why did the comment move?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the comment has still moved?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're looking at an older commit for some reason, this was fixed in 06f71f5

Comment thread nexus/fm/src/builder.rs Outdated
Comment on lines +325 to +444
/// A closed case with an outstanding alert request becomes "satisfied" via
/// marker presence, carry-forward drops it, and alert_generation bumps even
/// though no open-case builder mutations happened.
#[test]
fn carry_forward_drop_bumps_alert_generation() {
// A closed case with an outstanding alert request becomes "satisfied"
// via marker presence, carry-forward drops it, alert_generation bumps
// even though no open-case builder mutations happened.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why did the comment move?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment has still moved, can we move it back please?

Comment thread nexus/fm/src/builder.rs Outdated
Comment thread nexus/fm/src/builder.rs Outdated
@mergeconflict mergeconflict requested a review from hawkw June 18, 2026 23:37
Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/support_bundle.rs Outdated
Comment on lines +2432 to +2437
let query = bundle_marker_dsl::rendezvous_support_bundle_created
.filter(bundle_marker_dsl::support_bundle_id.eq_any(candidates))
.select(bundle_marker_dsl::support_bundle_id);

let explanation = query
.explain_async(&conn)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test feels a bit weird to me: we are constructing a query in the test and calling explain_async on it. This is intended to be an EXPLAIN test for the query used in the fm_rendezvous_existing_support_bundle_markers method, but it's not EXPLAINing the actual query used by that method. At present, the test is constructing the same query, but it could drift. In general, when we write these EXPLAIN tests, we do it by having a fn foo_query(args) -> impl RunnableQuery<Whatever> that both the actual query method and the EXPLAIN test both call to get the query. That way, if the query changes, the EXPLAIN test is still EXPLAINing the actual query we're using in production. Can we do that here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to delete this test (in 8b556e9) rather than fix it, actually, along with the alerts equivalent. I don't think we need an EXPLAIN test for this just to demonstrate that there's no full table scan, which would be caught by pretty much any other test that runs the query. We already have a good expectorate test for the query shape (see sitrep_guarded_insert.sql).

Comment thread nexus/fm/src/diagnosis/physical_disk.rs Outdated
Comment thread nexus/fm/src/builder.rs Outdated
Comment on lines +325 to +444
/// A closed case with an outstanding alert request becomes "satisfied" via
/// marker presence, carry-forward drops it, and alert_generation bumps even
/// though no open-case builder mutations happened.
#[test]
fn carry_forward_drop_bumps_alert_generation() {
// A closed case with an outstanding alert request becomes "satisfied"
// via marker presence, carry-forward drops it, alert_generation bumps
// even though no open-case builder mutations happened.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment has still moved, can we move it back please?

Comment thread nexus/fm/src/builder.rs Outdated
Comment thread nexus/fm/src/builder.rs Outdated
Comment on lines +524 to +527
use nexus_types::fm::case::SupportBundleRequest;
use nexus_types::support_bundle::BundleDataSelection;
use omicron_uuid_kinds::SupportBundleUuid;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some of these imports could probably be factored out?

Comment thread nexus/src/app/background/tasks/fm_rendezvous.rs
Comment thread nexus/types/src/fm/analysis_reports.rs Outdated

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool i feel pretty good about this

>,
>,
)
-> BoxFuture<'a, Result<SupportBundle, DbError>>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i'm glad this works that's alot less gross

Comment on lines +443 to +446
Err(
e @ (FmSupportBundleCreateError::TooManyBundles
| FmSupportBundleCreateError::Other(_)),
) => {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally i feel like the exhaustive matching here feels unnecessary, but 🤷‍♀️

@mergeconflict mergeconflict enabled auto-merge (squash) June 24, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants