Skip to content

manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342

Draft
Simon (simonhj) wants to merge 10 commits into
v1.xfrom
simon/bazel-subworkspace-discovery-v1x
Draft

manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342
Simon (simonhj) wants to merge 10 commits into
v1.xfrom
simon/bazel-subworkspace-discovery-v1x

Conversation

@simonhj
Copy link
Copy Markdown

@simonhj Simon (simonhj) commented May 28, 2026

Summary

Rewrites socket manifest bazel's Maven extraction pipeline so it (a)
discovers every workspace under the scan root, not just cwd, and (b)
relies on Bazel-native commands for repo enumeration instead of static
Starlark regex parsing.

1. Walk the scan root for every workspace (MODULE.bazel / WORKSPACE /
   WORKSPACE.bazel). Caller-supplied prune policy.
2. Per workspace, discover Maven hubs:
   - Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
   - WORKSPACE: probe conventional names (@maven, @maven_install,
     @maven_dev, @unpinned_maven, @maven_unpinned). Tri-state
     classifier (populated / empty / not-defined).
3. Per populated candidate, run per-repo metadata cquery:
     attr("tags", "\bmaven_coordinates=", @<repo>//...)
     ∪ attr("maven_coordinates", ".+", @<repo>//...)
     ∪ attr("maven_url", ".+", @<repo>//...)
   --output=jsonproto --keep_going
4. Aggregate across all workspaces; dedup by full Maven coordinate.
5. Write synthesized maven_install.json plus sidecar
   manifest-status.json with per-repo status entries.

Server isolation

Every Bazel invocation runs under a per-CLI-call
--output_user_root=<tempdir>. On per-repo cquery timeout the
orchestrator reaps the server (bazel shutdown) and rm -rfs the
tempdir, then mints a fresh one for subsequent repos. The
finally-block cleans up every tempdir that was minted. A single
hostile repository_rule no longer cascades into the rest of the run.

New modules

  • bazel-workspace-walk.mts — pure-function workspace walker with
    injected prune policy (ignoreDirNames, ignoreDirPrefixes).
    MAX_WALK_DEPTH = 8 (corpus survey: deepest realistic application
    layout is 7; bazel-self test fixtures hit 9). The orchestrator
    composes the codebase-wide IGNORED_DIRS from src/utils/glob.mts
    with Bazel-specific extras (bazel-*, dist*,
    .socket-auto-manifest, plus VCS/IDE dirs).
  • bazel-cquery.mts — per-repo metadata cquery + defensive
    jsonproto parser (dispatches on attribute[].type; accepts both
    camelCase stringValue/stringListValue and snake_case
    string_value/string_list_value; tolerates Bazel 5+ envelope and
    older per-line streamed shapes).

Rewritten modules

  • bazel-repo-discovery.mts — drops the entire Starlark regex
    parser (USE_REPO_RE, MAVEN_INSTALL_NAME_RE,
    parseMavenRepoCandidates, listLegacyStarlarkFiles,
    safeReadFile, parseVisibleRepoCandidates, validateMavenRepo,
    discoverMavenRepos). New primitives: parseShowExtensionOutput,
    classifyProbeResult, probeCandidate, CONVENTIONAL_MAVEN_REPO_NAMES.
  • bazel-query-runner.mts — centralises startup-flag construction
    (--bazelrc / --output_user_root / --output_base). Drops
    buildProbeFor (kind-only probe). Adds
    runBazelModShowMavenExtension and buildMavenProbeFor
    (lightweight presence-check cquery feeding the tri-state
    classifier). parseVisibleRepoCandidates moved to
    bazel-pypi-discovery.mts (its only remaining consumer).
  • extract_bazel_to_maven.mts — wraps the per-workspace algorithm
    in a tree walk. Drops the unsorted_deps.json fast path (the
    metadata cquery returns the same GAVs without depending on
    bazel-out symlinks or generated artefacts) and the lockfile
    merge-back loop (server walker handles it).

Sidecar shape

.socket-auto-manifest/manifest-status.json:

{
  "complete": true,
  "workspaces": [
    {
      "relPath": "",
      "mode": { "bzlmod": true, "workspace": false },
      "repos": [
        { "name": "maven", "status": "ok",      "artifactCount": 118, "durationMs": 28213 },
        { "name": "maven_dev", "status": "empty", "artifactCount": 0,   "durationMs": 102 }
      ]
    }
  ]
}

complete: false fires iff any repo timed out.

…ub-workspace discovery

The existing bazel-query discovery path only inspects MODULE.bazel /
WORKSPACE at the invocation cwd. Ruleset repos with per-example
sub-workspaces (rules_kotlin/examples, rules_js/examples, rules_rust,
rules_python) declare additional Maven artifacts in nested MODULE.bazel
projects with their own maven_install.json lockfiles. Those files were
silently dropped, leaving the CLI's SBOM a strict subset of what the
server-side depscan parser already returns from the same tree.

Add a walker that finds every checked-in maven_install.json under cwd
(pruning .git, node_modules, .socket-auto-manifest, and Bazel's
bazel-* convenience symlinks into <output_base>), parses each via the
existing parseUnsortedDepsJson v2-lockfile path, and merges the
artifacts into the SBOM after the bazel-query extraction step. Merge
is keyed by mavenCoordinates so the root workspace's lockfile (which
bazel-query already extracts) does not double-count; conflicting
group:artifact versions across sub-workspaces continue to surface as
the existing loud-failure error in normalizeToMavenInstallJson.

Verified against bazel-bench/oss/rules_kotlin: walker now surfaces all
10 examples/*/maven_install.json files and merges 393 unique artifacts
into the SBOM beyond what the root @kotlin_rules_maven discovery
returns. No regression on tink-java (0 lockfiles) or protobuf (1 root
lockfile, deduped against bazel-query's @maven extraction).
…er walker already covers it

The CLI was walking the tree for **/maven_install.json and **/*_maven_install.json
lockfiles and merging them into its output. The server-side scan walker matches the
same pattern natively via getReportSupportedFiles, so the CLI re-reading these files
duplicated work and produced output that was a strict subset of what the walker
already saw when the scan was uploaded.

Removes:
- bazel-lockfile-discovery.mts (196 lines)
- bazel-lockfile-discovery.test.mts (241 lines)
- extract_bazel_to_maven step 5b (33 lines): the merge-back-into-allArtifacts loop

The .socket-auto-manifest/maven_install.json the CLI emits is still picked up by
the same walker — that composition stays intact. After this change the CLI emits
only what running bazel produces (the complement of the walker's lockfile coverage).
…very

`findWorkspaceRoots` walks the tree from cwd and returns every directory
containing MODULE.bazel / WORKSPACE / WORKSPACE.bazel. Monorepos host
multiple workspace roots (e.g. examples/<name>/MODULE.bazel, mobile/
MODULE.bazel under an otherwise non-Bazel root); the per-workspace
algorithm in the orchestrator runs once per discovered root.

Pruning matches the previous lockfile walker: skip the usual non-workspace
directories (.git, node_modules, .socket-auto-manifest, etc.), Bazel's
`bazel-*` output_base symlinks (so we never recurse into tens of GiB of
generated state), and `dist*` build-output directories. Caps `MAX_WALK_DEPTH`
and `MAX_WORKSPACE_ROOTS` guard against pathological inputs and symlink
loops.

Pure-function module with no Bazel calls; unit tests use a tmpdir
fixture tree and cover the root-only, nested, prune, symlink, and
sort-determinism cases.
…+ probe primitives

Drop all static parsing of MODULE.bazel / WORKSPACE / *.bzl sources.
Bazel itself sees those files via `mod show_extension` and `cquery`; the
CLI no longer needs to interpret Starlark.

`parseShowExtensionOutput` consumes the text-format report from
  bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
and returns the hub repos (items annotated with `(imported by ...)`).
Generated per-artifact bullets are skipped; `DEBUG:` / `WARNING:` lines
are tolerated; the parser stops at the next `## ` section header so
multi-extension reports don't cross-contaminate.

`classifyProbeResult` turns a raw probe outcome into a tri-state status:
  - populated: code=0 + non-empty stdout
  - empty:     code=1 + "no targets found beneath"
  - not-defined: code=1 + "No repository visible" / "no such package",
                  or code=0 + empty stdout (WORKSPACE-mode silent miss)
The orchestrator treats `empty` and `not-defined` uniformly as skips; the
distinction is preserved for the sidecar status report.

`CONVENTIONAL_MAVEN_REPO_NAMES` exposes the names the legacy WORKSPACE
path probes (`maven`, `maven_install`, `maven_dev`, `unpinned_maven`,
`maven_unpinned`). `--bazel-maven-repo=` extras are appended by the
orchestrator (sibling todo).

Deleted exports: `parseMavenRepoCandidates`, `parseVisibleRepoCandidates`,
`validateMavenRepo`, `discoverMavenRepos`. Their replacements live in the
new primitives above; the orchestrator rewrite that wires them up lands
in a follow-up layer. `extract_bazel_to_maven.mts` does not typecheck
in this intermediate state — fixed in the orchestrator commit.

Tests cover the parser fixture (hub vs generated, separator variants,
multi-section reports), the tri-state classifier (every documented
input), and the verbose-logging contract for `probeCandidate`.
…tate probe

bazel-query-runner now centralises startup-flag construction so every
spawn — query, cquery, mod show_extension, mod dump_repo_mapping —
threads `--bazel-rc`, `--output_user_root`, and `--output_base`
consistently. The new optional `outputUserRoot` field on
`BazelQueryOptions` is the Maven path's hook for per-invocation server
isolation; the orchestrator (next commit) mkdtemp's a fresh path and
will reap the server via `bazel shutdown` + `rm -rf` on success and on
timeout, so timed-out servers no longer leak across CLI invocations.

Add `runBazelModShowMavenExtension`: invokes
  bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
to enumerate Maven hubs directly from the rules_jvm_external extension
report, replacing the over-enumerating `dump_repo_mapping` surface on
the Maven path. `runBazelModShowVisibleRepos` is kept around for the
legacy PyPI extractor, which has not been rescoped yet.

Replace the Maven-side `buildProbeFor` (which emitted a kind-only
`kind("jvm_import rule|aar_import rule", @repo//:*)` query) with
`buildMavenProbeFor`, a lightweight `cquery '@<name>//... --output=label
--keep_going'` presence check whose result feeds the new tri-state
classifier in bazel-repo-discovery. Kind-only filtering missed
POM-only / native / AAR-without-aar_import artefacts and any future
rules_jvm_external rule shape; the metadata filter is now applied by
the per-repo extraction cquery (next layer), not by the probe.

Update `buildPypiProbeFor`'s return shape to include stderr so it
satisfies the new `RepoProbe` type contract. Move
`parseVisibleRepoCandidates` and the `ValidationResult` type into
bazel-pypi-discovery (their only remaining consumer); the Maven module
no longer carries dump_repo_mapping-shaped code.

Tests cover the new argv shapes for every spawn surface, the
outputUserRoot startup-flag placement (before subcommand), the
Maven probe argv (cquery + @repo//... + --output=label + --keep_going),
and the full result-triple propagation (code/stdout/stderr) that the
tri-state classifier needs.
`runMetadataCqueryForRepo` executes the per-repo extraction cquery and
returns a structured outcome (`ok` / `partial` / `timeout` / `empty` /
`error`) so the orchestrator can populate sidecar status without
custom error plumbing per call site. The cquery target expression is
the union of three predicates — `attr("tags", "\bmaven_coordinates=",
...)`, `attr("maven_coordinates", ".+", ...)`, and `attr("maven_url",
".+", ...)`. That matches rules_jvm_external's `jvm_import` /
`aar_import` shapes, Bazel-native `java_library` with direct
`maven_coordinates`, and POM-only / source-jar shapes that carry only
`maven_url`. Word-boundary `\b` in the tags predicate prevents matches
on values like `pre_maven_coordinates=fake`.

`parseCqueryJsonproto` is defensive about the jsonproto encoding:
dispatches on `attribute[].type`, accepts both camelCase
(`stringValue`, `stringListValue`) and snake_case (`string_value`,
`string_list_value`) payload keys, and tolerates both the Bazel 5+
envelope shape (`{ "results": [{ "target": {...} }] }`) and the older
per-line streamed shape. Coordinate extraction prefers the direct
`maven_coordinates` attribute; falls back to scanning `tags` for
`maven_coordinates=G:A:V`. Provenance lands in `sourceRepo` as
`<workspace-rel-path>:<repoName>` (or just `<repoName>` at the root),
so the orchestrator's dedup can attribute artifacts back to their
discovery site.

Timeout handling: spawn rejections with `timedOut` / `killed` /
`SIGTERM` / `SIGKILL` map to `status: 'timeout'`. The runner does NOT
delete the outputUserRoot — server lifecycle (reap via
`bazel shutdown` + `rm -rf`) is the orchestrator's concern so that a
single tempdir can hold multiple per-repo runs.

Also widen `ExtractedArtifact.ruleKind` from the literal
`'jvm_import' | 'aar_import'` union to `string`. The legacy text-format
parsers only ever set those two values, but the metadata cquery
returns whatever `ruleClass` Bazel reports (`java_library`,
`kt_jvm_import`, any future rules_jvm_external rule). Existing
consumers only read the field diagnostically; nothing else changes.

Tests cover the parser (envelope, per-line stream, snake_case
fallback, direct-vs-tag preference, missing-coordinate skip, empty
input), the argv builder (target expression union, startup-flag
placement, `--bazel-flag` placement, invocationFlags order), and the
runner's status classification including the spawn-timeout branch.
…thm in a tree walk

`extractBazelToMaven` now walks the scan root for every workspace
(MODULE.bazel / WORKSPACE / WORKSPACE.bazel) and runs the per-workspace
extraction algorithm in each one. Monorepos like rules_kotlin
(examples/<name>/MODULE.bazel) and projects with mobile sub-workspaces
(mobile/MODULE.bazel under a non-Bazel root) are no longer
silently dropped to the root-only path.

Per workspace:
  1. Detect Bzlmod vs WORKSPACE mode.
  2. Discover candidate Maven hubs:
       - Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven,
         parsed via parseShowExtensionOutput.
       - WORKSPACE (or Bzlmod fallback): probe the conventional names
         (maven, maven_install, maven_dev, unpinned_maven, maven_unpinned)
         plus any customer-supplied extras via the tri-state classifier.
  3. Per populated candidate: run the metadata cquery
     (`attr("tags", "\bmaven_coordinates=", @<repo>//...)` ∪ direct
     `maven_coordinates` / `maven_url` attrs) and accept the parsed
     artefacts.
  4. Aggregate, then dedup across workspaces by full Maven coordinate.

Server isolation is now invariant: every Bazel invocation runs under a
per-CLI-call --output_user_root=<tempdir>. On per-repo cquery timeout
the orchestrator reaps the server (`bazel shutdown`) and `rm -rf`'s the
tempdir, then mints a fresh one for subsequent repos — a single bad
hub no longer cascades into the rest of the run. The finally-block
cleanup reaps every tempdir that was minted, including the last one.

Sidecar `manifest-status.json` lands beside the synthesized
`maven_install.json`. Each entry records the repo's classified status
(ok / partial / timeout / empty / error), artifact count, and duration,
so the server-side can surface partial results to the customer. The
top-level `complete: false` flag fires iff any repo timed out.

Deleted: the unsorted_deps.json fast path (`extractFromOneRepo`,
`bazelExternalDir`, `isForceQueryFallbackEnabled` env knob) — the
metadata cquery returns the same GAVs the fast path used to recover,
without depending on bazel-out symlinks or generated artefacts.
Deleted: the lockfile merge (already done in a previous commit on this
branch); deleted: the kind-only probe and dump_repo_mapping enumeration.

The orchestrator's `ExtractBazelOptions` now accepts
`extraMavenRepoNames` (legacy WORKSPACE non-conventional hub names) and
`perRepoTimeoutMs` (per-repo cquery cap). The CLI flag wiring lands in
a sibling commit; existing call sites continue to pass the same fields
they did before.

Existing `extract_bazel_to_maven.test.mts` is pinned to the old
unsorted_deps fast path and is replaced wholesale in the next commit
(test layer).
…e pipeline

The previous tests pinned the legacy unsorted_deps.json fast path,
kind-only probes, and dump_repo_mapping enumeration. The new tests
mock the orchestrator's three external collaborators —
findWorkspaceRoots, runBazelModShowMavenExtension, runMetadataCqueryForRepo —
and assert on the contract that matters: end-to-end Bzlmod and
WORKSPACE-mode flows, the per-repo cquery loop, cross-workspace
coordinate dedup, the timeout → re-mint loop, sidecar
`manifest-status.json` shape, and `extraMavenRepoNames` threading.

Pure-function `normalizeToMavenInstallJson` keeps a focused trio of
unit tests (dedup, version-conflict, sha256-preservation). The
fixture-driven .socket.facts.json non-emission assertion stays so the
Maven-path-vs-facts-path invariant is exercised.

Also patch the PyPI test mock: parseVisibleRepoCandidates moved from
bazel-repo-discovery to bazel-pypi-discovery in a previous commit, so
the test's vi.mock now mirrors the actual export surface. The probe
fixture grows a `stderr` field to match the new RepoProbe contract.
…GNORED_DIRS

`findWorkspaceRoots` no longer hardcodes the directory-prune set —
callers pass `ignoreDirNames: ReadonlySet<string>` and
`ignoreDirPrefixes: readonly string[]` via options. Neither defaults
to anything; absent means no pruning. This keeps the walker decoupled
from any particular ignore policy and avoids duplicating the
codebase-wide `IGNORED_DIRS` list.

`src/utils/glob.mts` exports `IGNORED_DIRS` so the orchestrator can
compose it with Bazel-specific extras. The orchestrator's composed
set: `IGNORED_DIRS` plus `.hg`, `.idea`, `.pnpm-store`,
`.socket-auto-manifest`, `.svn`, `.vscode`; prefixes `bazel-` and
`dist`.

Also tighten `MAX_WALK_DEPTH` from 16 → 8. Deepest workspace marker
observed across the surveyed OSS corpus is 9 (bazel-self test
fixtures); deepest in realistic application code is 7 (checkmk's
thirdparty layout). The cap gives one level of headroom over the
realistic max while still guarding against pathological symlink loops
that slipped past any prefix prune the caller supplied.

Walker test rewritten against the new injected API: covers the
no-prune-by-default case (`node_modules/MODULE.bazel` surfaces unless
the caller ignores `node_modules`), injected name and prefix prunes,
and the bazel-* symlink case under the prefix injection.
@simonhj Simon (simonhj) force-pushed the simon/bazel-subworkspace-discovery-v1x branch from 20957bc to 23e2f96 Compare May 28, 2026 19:31
No consumer reads it today. The orchestrator still tracks per-repo
timeouts to decide ExtractBazelResult.ok and to reap+remint the
output_user_root, but no longer serialises the per-workspace /
per-repo status report to disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant