diff --git a/docs/firecracker-snapshots.md b/docs/firecracker-snapshots.md new file mode 100644 index 0000000..9e41904 --- /dev/null +++ b/docs/firecracker-snapshots.md @@ -0,0 +1,223 @@ +# RFC: Firecracker snapshots for instant bazel-diff starts + +**Status:** Draft +**Audience:** bazel-diff maintainers / contributors +**Scope decided:** CLI hooks in the Kotlin tool + a Go orchestration tool, capturing a +*full warm Bazel server*, targeting *self-hosted CI* (we control the host kernel and CPU model). + +--- + +## 1. Motivation + +bazel-diff's own JVM CLI starts in well under a second. That is not where the time goes. The +canonical workflow ([`bazel-diff-example.sh`](../bazel-diff-example.sh)) is: + +1. `bazel run :bazel-diff` — build the tool +2. `git checkout ` → `generate-hashes` → runs `bazel query deps(//...:all-targets)` +3. `git checkout ` → `generate-hashes` → another `bazel query` +4. `get-impacted-targets` — cheap, pure JSON diff + +The cost is the `bazel query` in steps 2/3, which forces: + +- **Bazel server startup** (JVM warmup). +- **External-repo / bzlmod resolution + repository-cache fetch.** `BazelQueryService` even shells + out to `bazel mod dump_repo_mapping` and `bazel mod show_repo` + ([`BazelQueryService.kt`](../cli/src/main/kotlin/com/bazel_diff/bazel/BazelQueryService.kt)). +- **Full Skyframe graph load + package analysis** for `deps(//...)`. + +On a large monorepo this is minutes per cold start. A Firecracker microVM snapshot lets us capture +that warm state once and restore it in ~sub-second, so the PR-time path re-analyzes only the changed +packages. + +**Goal:** instant starts of bazel-diff by restoring a microVM whose Bazel server already has the +build graph loaded and external repos fetched. + +--- + +## 2. Key architectural split + +The Firecracker record/restore itself is a **host-level concern** — it talks to the Firecracker REST +API over a unix socket (optionally via `jailer`). It is *not* something the Kotlin CLI does. The work +therefore splits into two pieces: + +| Piece | Where | Responsibility | +| --- | --- | --- | +| **CLI hooks** | Kotlin (`cli/`) | Make snapshots deterministic and *safe*: warm-then-signal, emit a cache key, bake base hashes. | +| **Orchestration tool** | Go (`tools/firecracker/`) | Boot/warm/snapshot and restore/checkout/run the microVM via the Firecracker API. | + +Consume needs **no new bazel-diff command** — it is the existing `generate-hashes` + +`get-impacted-targets` run against base hashes baked into the snapshot. + +--- + +## 3. Lifecycle + +### Record (per base SHA — on merge to master, or nightly) + +``` +host: build read-only rootfs ──► boot Firecracker microVM (TAP net for fetch) + (bazel + JDK + git + bazel-diff binary + workspace @ baseSHA) + VM: bazel-diff warmup ──► bazel query deps(//...) loads Skyframe + fetches externals + ──► writes /snap/base_hashes.json + /snap/fingerprint.json + ──► exits 0 = "safe to snapshot" +host: pause VM ──► snapshot {mem_file, vmstate} ──► freeze rootfs as backing image + store keyed by fingerprint + baseSHA +``` + +### Consume (per PR / target SHA — the hot path) + +``` +host: fingerprint(targetEnv) == snapshot.fingerprint? ── no ──► fall back to cold run + │ yes + restore microVM (COW overlay on disk, UFFD lazy memory load) ~sub-second + VM: git checkout ──► warm server does INCREMENTAL re-analysis of changed pkgs + bazel-diff generate-hashes (fast — server already warm) + bazel-diff get-impacted-targets -sh /snap/base_hashes.json -fh -o +host: extract impacted targets ──► discard overlay +``` + +--- + +## 4. New CLI surface + +Both new subcommands slot into the existing picocli `subcommands` list in +[`BazelDiff.kt`](../cli/src/main/kotlin/com/bazel_diff/cli/BazelDiff.kt) alongside +`GenerateHashesCommand` and `GetImpactedTargetsCommand`. + +### 4.1 `bazel-diff warmup` + +The record-side entrypoint. Effectively `generate-hashes` for the base revision, plus: + +- Writes base hashes to a known path (`--base-hashes`, default `/snap/base_hashes.json`). +- Writes the fingerprint file (see §5). +- Exits `0` **only** once the query has completed and the server is warm + quiesced. The host + watches for this clean exit as the "safe to snapshot" signal. + +Implementation reuses `GenerateHashesCommand`'s plumbing; warmup is essentially generate-hashes with +metadata side-effects and a clear success contract. + +### 4.2 `bazel-diff fingerprint` + +Computes the snapshot **cache key** and writes it as JSON. Used both at record time (to tag the +snapshot) and at consume time (to validate a candidate snapshot before trusting it). See §5. + +### 4.3 Consume + +No new command. The orchestrator runs the existing `generate-hashes` for the target revision, then +`get-impacted-targets -sh /snap/base_hashes.json -fh `. + +--- + +## 5. Correctness — the cache key and the fail-safe + +bazel-diff's core promise is *"an incorrect affected set is worse than none."* A restored snapshot +must produce **the same answer as a cold run**. Two layers of defense: + +### 5.1 bazel-diff already re-hashes file content itself + +`SourceFileHasher` reads and hashes source file contents independently of the Bazel server. So +*content* correctness does not depend on the warm server's incrementality — only the **graph +structure / rule attributes** returned by `bazel query` do, and Bazel's incremental analysis is the +trusted core there. + +### 5.2 The fingerprint (cache key) + +A snapshot is only safe to consume when the consuming environment matches the recording environment +on everything that could change the graph. The fingerprint is a hash over: + +- **Bazel version** (already detected in `BazelQueryService.determineBazelVersion`). +- **`MODULE.bazel.lock`** (bzlmod resolution state). +- **`.bazelrc`** (and any imported rc files). +- **bazel-diff version** (`VersionProvider`). +- **The relevant flag set** — `--useCquery`, `cqueryCommandOptions`, `bazelCommandOptions`, + `startupOptions`, `--includeTargetType`, `--targetType`, fine-grained external-repo config, etc. + (anything that changes what `generate-hashes` queries or how it hashes). + +**Fail-safe rule:** any fingerprint mismatch → do **not** use the snapshot; fall back to a cold run. +A stale snapshot is never silently trusted. + +### 5.3 CI canary + +Recommended: a periodic CI job that runs the *snapshot-consumed* result against a *cold* result for +the same revision pair and asserts set equality. This builds and maintains empirical trust and catches +any Bazel incremental-analysis edge case (env vars, repository-rule re-trigger conditions, untracked +files) before it reaches users. + +--- + +## 6. Firecracker specifics (self-hosted) + +Controlling the host makes several normally-hard issues tractable: + +- **CPU model pinning.** Snapshots only restore on a matching microarchitecture. Pin the CI instance + type or set a Firecracker CPU template. (This is exactly the constraint that the cloud-portability + option would have made painful — out of scope here.) +- **Disk.** Read-only backing rootfs + a per-restore copy-on-write overlay so each consumed VM is + isolated and disposable. +- **Memory.** Use diff snapshots + UFFD on-demand page loading; keep the mem file on local NVMe/tmpfs + for fast restore. +- **Clock + network.** Resync the guest clock on resume (a known snapshot gotcha) and re-attach the + TAP device. To make consume fully offline, pre-bake full git history into the rootfs so + `git checkout ` needs no network. +- **Isolation.** Run under `jailer` in CI. + +--- + +## 7. Snapshot store layout + +Keyed by `fingerprint + baseSHA`: + +``` +/// + mem_file # guest memory image (diff snapshot) + vmstate # Firecracker microVM state + rootfs.backing # frozen read-only disk image + base_hashes.json # produced by `bazel-diff warmup` + metadata.json # fingerprint, baseSHA, bazel version, created-at, bazel-diff version +``` + +Consume resolves a snapshot by: matching fingerprint, then choosing a `baseSHA` that is an ancestor +of the target SHA (git merge-base), preferring the nearest ancestor to minimize incremental +re-analysis. + +--- + +## 8. Orchestration tool (Go, `tools/firecracker/`) + +Go chosen for the official `firecracker-go-sdk`, clean API access, and a static binary for CI. UX +mirrors `bazel-diff-example.sh` as the familiar entrypoint. + +``` +bazel-diff-snap record --workspace --base-sha --store [firecracker opts] +bazel-diff-snap consume --workspace --target-sha --store --out +``` + +- `record`: build/prepare rootfs → boot VM → run `bazel-diff warmup` + `fingerprint` → pause → + snapshot → freeze rootfs → write store entry. +- `consume`: compute `fingerprint` for target env → resolve compatible snapshot (else cold fall-back) + → restore (COW overlay, UFFD) → `git checkout` → `generate-hashes` → `get-impacted-targets` → + extract → tear down. + +--- + +## 9. Phasing + +1. **CLI hooks** — `fingerprint` + `warmup` subcommands, pure Kotlin, fully unit-testable, no VM + required. Lands value independently (the fingerprint is useful for any snapshot/caching scheme). +2. **Orchestration tool** — `tools/firecracker/` in Go: `record` / `consume`. +3. **Correctness canary + docs** — snapshot-vs-cold equality check in CI; README section. + +--- + +## 10. Open questions + +- **Quiescence detection.** How does `warmup` know the server is fully idle (no background Skyframe + work) before exit? Likely "query returned + process exited 0" is sufficient since the query is + synchronous, but worth validating. +- **Snapshot freshness policy.** How far back can a base SHA be before incremental re-analysis stops + being worth it vs. cold? Needs measurement; drives record cadence (every merge vs. nightly). +- **rootfs build pipeline.** Reuse an existing base image + inject workspace, or build per-record? + Affects record time and store size. +- **Flag-set canonicalization.** Exact list of flags that must enter the fingerprint vs. those that + are snapshot-neutral — needs an explicit, reviewed allow/deny list to avoid both false mismatches + (wasted cold runs) and false matches (incorrect results).