engine: rename pageindex→treewalk + auto strategy default (HAL-106) by hallelx2 · Pull Request #35 · hallelx2/vectorless-engine

hallelx2 · 2026-06-06T12:39:28Z

Strips the borrowed PageIndex name from the engine and makes treewalk the canonical strategy.

What's in here

Rename pageindex → treewalk across code, API/docs (openapi.yaml), and config examples (config.example*.yaml), including env-var references. Zero pageindex references remain.
New auto strategy as the default: small docs (leaf content within the call budget) → single-pass; larger/complex docs with valid page metadata → treewalk; non-paged docs fall back to single-pass.
Hardened AutoStrategy selection paths to be nil-safe (pick/Select/SelectWithCost) and added tests for non-paged fallback + nil-safety.
Validation/tests/docs alignment updates for auto (config validation coverage and request-override docs).

Verification

go build ./... — OK
go vet — OK
package tests (retrieval / config / api / handler) — all green

Not in here (tracked separately)

Retiring agentic/chunked-tree + folding the dedicated endpoint → HAL-108
Unifying the section tree + TOC into one canonical map → HAL-109
HyDE → candidate-questions rename → HAL-107

Summary by CodeRabbit

New Features
- Added TreeWalk retrieval strategy with dedicated /v1/answer/treewalk endpoint
- Added auto strategy for dynamic retrieval method selection
Changes
- Replaced PageIndex retrieval configuration with TreeWalk configuration
- Default retrieval strategy changed to "auto"
- Updated environment variable names for retrieval settings

Summary by CodeRabbit

New Features
- Added POST /v1/answer/treewalk with streaming (SSE) and reasoning traces; introduced "treewalk" and a new "auto" retrieval strategy.
Behavior / Request Overrides
- Per-request overrides supported (max_hops, max_pages_per_fetch, model) and stream=true for SSE.
Configuration Changes
- Retrieval config renamed pageindex → treewalk; default strategy now "auto". Env vars moved to VLE_/VLS_RETRIEVAL_TREEWALK_*.
Chores / Bug Fixes
- CI toolchain bumped to Go 1.25.11; improved resource cleanup to ignore close errors where appropriate.

Strip the borrowed PageIndex name — treewalk is now the native identifier for Vectorless's tree-navigation strategy across the strategy id, Go types, config keys, the /v1/answer/treewalk endpoint, and all tests. Pure rename; go build, go vet, and package tests all green. Remaining HAL-106 sub-tasks: promote treewalk to default strategy, dedup the agentic overlap, and fold the dedicated endpoint into /v1/answer.

Smart default selection: route small documents (total leaf content within the call budget) to single-pass for one fast cheap call, and larger or more complex documents to treewalk for agentic navigation. New AutoStrategy in pkg/retrieval with unit tests; wired into config (default value + validation) and both binaries' strategy builders + the selectable-strategy set. go build, go vet, and package tests all green.

sourcery-ai

Sorry @hallelx2, your pull request is larger than the review limit of 150000 diff characters

coderabbitai · 2026-06-06T12:39:39Z

📝 Walkthrough

Walkthrough

Replaces PageIndex with TreeWalk across config, OpenAPI, server/router, handlers, and retrieval strategy; adds AutoStrategy; rewrites endpoint tests to /v1/answer/treewalk; applies many best‑effort Close() changes and serializes pdftable calls.

Changes

TreeWalk migration and Auto strategy

Layer / File(s)	Summary
Config, defaults and OpenAPI `pkg/config/config.go`, `internal/config/config.go`, `internal/config/config_treewalk_test.go`, `pkg/config/config_test.go`, `openapi.yaml`, `config.example.yaml`, `config.server.example.yaml`	Replaces PageIndex config with TreeWalkBlock, changes default `retrieval.strategy` → `"auto"`, adds TreeWalk env overrides and tests, updates OpenAPI to `POST /v1/answer/treewalk` and TreeWalk schemas.
Core retrieval implementations `pkg/retrieval/treewalk_strategy.go`, `pkg/retrieval/treewalk_strategy_test.go`, `pkg/retrieval/auto.go`, `pkg/retrieval/auto_test.go`, `pkg/retrieval/strategy.go`	Implements `TreeWalkStrategy`, `TreeWalkAction` parser, `TreeWalkEvent`, `computeTreeWalkTraceToken`, and adds `AutoStrategy` to pick small vs large sub-strategies by token/budget; tests updated/renamed accordingly.
API handlers and router `internal/api/treewalk.go`, `internal/handler/answer_treewalk.go`, `internal/handler/router.go`, `internal/api/server.go`, `internal/handler/router_parity_test.go`	Introduces `/v1/answer/treewalk` handler (SSE streaming and non-streaming), per-request strategy cloning/overrides, swaps deps to `TreeWalkStrategy`/`TreeWalk`, and updates router wiring and parity tests.
Server/engine startup and builders `cmd/server/main.go`, `cmd/engine/main.go`	Constructs and injects `TreeWalkStrategy` when enabled & LLM present, updates `buildStrategy`/`buildStrategySet` to support `"treewalk"` and `"auto"` (SinglePass + TreeWalk), wires storage-backed page loader and optional DB TOC provider, and logs configured caps.
Handler tests and end-to-end coverage `internal/api/treewalk_test.go`, `internal/handler/query_strategy_test.go`	Rewrites endpoint tests to exercise `/v1/answer/treewalk`, adds scripted LLM for TreeWalk tests, covers reasoning trace, SSE streaming, citation dedup/cap, per-request overrides, TOC fallback, error mappings (400/404/501), and replay persistence checks.
Best-effort resource-close and misc maintenance many files (`internal/`, `pkg/`, `cmd/*`, `pkg/parser/pdf.go`, `.github/workflows/ci.yml`, `.golangci.yml`)	Converts many deferred `Close()` calls to best-effort patterns that discard `Close()` errors, serializes `pdftable` calls with a mutex, updates CI toolchain pin and indirect deps, and small lint/middleware/test adjustments.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

hallelx2/vectorless-engine#26: Introduced PageIndex strategy and /v1/answer/pageindex wiring that this PR replaces.
hallelx2/vectorless-engine#17: Modifies answer endpoint wiring and DI; intersects with routing/Deps changes.
hallelx2/vectorless-engine#13: Touches retrieval strategy selection and related wiring (agentic/auto interactions).

"I hopped through code with nimble feet,
Replaced old pages with a TreeWalk beat,
Auto picks the path, events light the way,
Citations tidy, tests now play,
A tiny rabbit cheers this tidy feat." 🐇

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch halleluyaholudele/hal-106-treewalk-rename

Copilot

Pull request overview

This PR completes the engine-wide rename from the borrowed “PageIndex” terminology to “TreeWalk”, updates the dedicated answer endpoint to /v1/answer/treewalk, and introduces an auto retrieval strategy as the new default to route small documents to single-pass and larger documents to tree-walking.

Changes:

Renames PageIndex → TreeWalk across retrieval strategy types, identifiers, config blocks, handlers, and tests.
Adds AutoStrategy and switches the default retrieval.strategy to auto.
Rewires both binaries and routers to mount /v1/answer/treewalk and to expose treewalk/auto as selectable strategies.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/tree/tree.go	Updates TOCNode docs to reference TreeWalk terminology.
pkg/retrieval/treewalk_strategy.go	Renames PageIndexStrategy/Event/Action → TreeWalkStrategy/Event/Action and updates identifiers/prompts/log strings.
pkg/retrieval/treewalk_strategy_test.go	Renames and updates tests to exercise TreeWalkStrategy.
pkg/retrieval/strategy.go	Updates Result.PagesRead comment to cite treewalk as the page-based example.
pkg/retrieval/auto.go	Introduces AutoStrategy routing between “Small” and “Large” strategies.
pkg/retrieval/auto_test.go	Adds unit tests for AutoStrategy routing and naming.
pkg/ingest/toc_builder.go	Updates TOC builder docs/comments to reference TreeWalk pipeline naming.
pkg/ingest/minimal_mode_test.go	Updates minimal-mode test commentary to reference TreeWalk coverage.
pkg/ingest/ingest.go	Updates ingest docs/comments referencing the page-based strategy endpoint and TOC stage naming.
pkg/config/config.go	Renames retrieval.pageindex → retrieval.treewalk, switches default retrieval.strategy to auto, updates env overrides and validation.
pkg/config/config_test.go	Updates defaults/validation/env override tests for TreeWalk + auto default.
internal/handler/router.go	Renames wiring from /answer/pageindex to /answer/treewalk and updates Deps fields accordingly.
internal/handler/router_parity_test.go	Updates deployed router parity assertions for /v1/answer/treewalk.
internal/handler/query.go	Updates strategy override docs/comments and keeps override resolution by map lookup.
internal/handler/query_strategy_test.go	Updates override acceptance test from pageindex → treewalk.
internal/handler/answer_treewalk.go	Renames the multi-tenant deployed handler to AnswerTreeWalkHandler and rewires event tracing/citations.
internal/config/config.go	Forwards TreeWalk env overrides into embedded engine config.
internal/config/config_treewalk_test.go	Adds tests for TreeWalk max-citations env pass-through behavior.
internal/config/config_pageindex_test.go	Removes PageIndex env forwarding tests (superseded by TreeWalk equivalents).
internal/api/treewalk.go	Renames standalone engine handler and supporting helpers to TreeWalk equivalents.
internal/api/treewalk_test.go	Updates standalone API tests for /v1/answer/treewalk behavior.
internal/api/server.go	Mounts /v1/answer/treewalk and renames Deps fields for TreeWalk.
cmd/server/main.go	Switches default strategy to auto; wires selectable strategies set including treewalk and auto; rewires dedicated answer endpoint strategy instance.
cmd/engine/main.go	Switches default strategy to auto; rewires dedicated answer endpoint strategy instance and strategy builder cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+func (a *AutoStrategy) pick(t *tree.Tree, budget ContextBudget) Strategy {
+	threshold := a.SinglePassMaxTokens
+	if threshold <= 0 {
+		threshold = budget.Available()
+	}


+// Select implements Strategy.
+func (a *AutoStrategy) Select(ctx context.Context, t *tree.Tree, query string, budget ContextBudget) ([]tree.SectionID, error) {
+	return a.pick(t, budget).Select(ctx, t, query, budget)
+}


+func (a *AutoStrategy) SelectWithCost(ctx context.Context, t *tree.Tree, query string, budget ContextBudget) (*Result, error) {
+	s := a.pick(t, budget)
+	if cs, ok := s.(CostStrategy); ok {
+		return cs.SelectWithCost(ctx, t, query, budget)
+	}


 	// Strategy optionally overrides the configured retrieval strategy
-	// for THIS request only. One of: chunked-tree, pageindex, agentic,
+	// for THIS request only. One of: chunked-tree, treewalk, agentic,
 	// single-pass. Empty uses the server default. This lets a caller
 	// (e.g. the benchmark harness) A/B strategies against the same
 	// running engine without a redeploy. Unknown values return 400.


+	for _, s := range []string{"single-pass", "chunked-tree", "agentic", "treewalk"} {
 		cfg := Default()


 	switch c.Retrieval.Strategy {
-	case "single-pass", "chunked-tree", "agentic", "pageindex":
+	case "auto", "single-pass", "chunked-tree", "agentic", "treewalk":
 	default:
 		return fmt.Errorf("unknown retrieval.strategy: %q", c.Retrieval.Strategy)
 	}


coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/api/treewalk.go (1)
472-508: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unused helper or add test coverage.

The treeWalkTraceTokenFromCitations function is unused according to static analysis (U1000) and the comment admits it's "unused at the moment." Unused code increases maintenance burden and can drift from the actual implementation.

Either:

Add test coverage that uses this function to verify trace token computation, or

Remove it (can always restore from git history when needed)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/api/treewalk.go` around lines 472 - 508, The function
treeWalkTraceTokenFromCitations is flagged as unused; either remove it or add
test coverage that calls it: if removing, delete the entire function (including
its references to tree.DocumentID, model, ranges, retrieval.SystemPromptVersion)
and run static checks; if keeping, add a unit test that imports the same
package, calls treeWalkTraceTokenFromCitations with representative DocumentID,
model and ranges inputs, and asserts the returned hex string matches a
precomputed expected value (or verifies deterministic properties), ensuring the
test exercises the sha256/hex path so the function is no longer unused.
Source: Linters/SAST tools
internal/handler/answer_treewalk.go (1)
277-299: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist replay data on the SSE path too.

The non-stream flow stores the final payload through writeJSONWithReplay, but the streaming flow only emits the final "answer" event. That leaves streamed trace_token values orphaned in practice, because nothing ever writes the replay entry behind them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/answer_treewalk.go` around lines 277 - 299, The streaming
path currently only calls emitSSE("answer", final) and never persists the replay
entry, so trace_token values from the stream are never written; update the
streaming end in the handler (where buildCitations produces citations and final
is assembled) to call the same persistence helper used by the non-stream flow
(writeJSONWithReplay) with the final payload (including trace_token) before or
immediately after emitSSE so the replay entry is created; ensure you pass the
same context/body identifiers and any required replay metadata to
writeJSONWithReplay so replay lookups for trace_token succeed.

🧹 Nitpick comments (1)

internal/config/config.go (1)

329-333: ⚡ Quick win

Clarify fallback wording to match actual precedence behavior.

On Line 333, the comment says garbled/negative input “leaves the engine default (3)”. In this function, parse failure/no-op preserves the current configured value (which can come from YAML), not always default 3.

Suggested comment-only fix

-	// VLE_; a garbled or negative value leaves the engine default (3).
+	// VLE_; a garbled or negative value leaves the current configured
+	// value unchanged (default is 3 when unset elsewhere).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/config/config.go` around lines 329 - 333, Update the comment on the
TreeWalk final-citation cap to accurately reflect precedence and fallback: state
that VLS_ overrides VLE_, and that a garbled or negative environment value is a
no-op that preserves the currently configured value (which may come from YAML or
a prior config), rather than always reverting to the engine default of 3; locate
the comment near the TreeWalk cap handling in internal/config/config.go
(references: VLS_ and VLE_) and revise the wording accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/handler/answer_treewalk.go`:
- Around line 260-262: serveStream currently forwards the whole
retrieval.TreeWalkEvent via strat.OnEvent/emitSSE regardless of
body.IncludeReasoning, leaking Reasoning/Answer; change the strat.OnEvent
handler (and the emit path) to check the request flag (body.IncludeReasoning)
and, when false, send a redacted payload that omits or zeroes sensitive fields
(e.g., TreeWalkEvent.Reasoning, TreeWalkEvent.Answer) or build a separate
minimal event type to emit; ensure the decision is done inside the strat.OnEvent
closure (or before calling emitSSE) so raw TreeWalkEvent is never streamed
unless IncludeReasoning is true.

In `@pkg/retrieval/auto.go`:
- Around line 52-68: The pick method on AutoStrategy can return nil if both
a.Small and a.Large are nil, causing Select/SelectWithCost to panic; fix by
validating strategies at construction or adding defensive checks: update NewAuto
to require non-nil Small or Large and return an error (preferred), or change
pick to return a non-nil fallback (e.g., prefer a.Small then a.Large) and update
Select and SelectWithCost to detect a nil return and return an error instead of
dereferencing; reference symbols: AutoStrategy, pick, NewAuto, Select,
SelectWithCost.

---

Outside diff comments:
In `@internal/api/treewalk.go`:
- Around line 472-508: The function treeWalkTraceTokenFromCitations is flagged
as unused; either remove it or add test coverage that calls it: if removing,
delete the entire function (including its references to tree.DocumentID, model,
ranges, retrieval.SystemPromptVersion) and run static checks; if keeping, add a
unit test that imports the same package, calls treeWalkTraceTokenFromCitations
with representative DocumentID, model and ranges inputs, and asserts the
returned hex string matches a precomputed expected value (or verifies
deterministic properties), ensuring the test exercises the sha256/hex path so
the function is no longer unused.

In `@internal/handler/answer_treewalk.go`:
- Around line 277-299: The streaming path currently only calls emitSSE("answer",
final) and never persists the replay entry, so trace_token values from the
stream are never written; update the streaming end in the handler (where
buildCitations produces citations and final is assembled) to call the same
persistence helper used by the non-stream flow (writeJSONWithReplay) with the
final payload (including trace_token) before or immediately after emitSSE so the
replay entry is created; ensure you pass the same context/body identifiers and
any required replay metadata to writeJSONWithReplay so replay lookups for
trace_token succeed.

---

Nitpick comments:
In `@internal/config/config.go`:
- Around line 329-333: Update the comment on the TreeWalk final-citation cap to
accurately reflect precedence and fallback: state that VLS_ overrides VLE_, and
that a garbled or negative environment value is a no-op that preserves the
currently configured value (which may come from YAML or a prior config), rather
than always reverting to the engine default of 3; locate the comment near the
TreeWalk cap handling in internal/config/config.go (references: VLS_ and VLE_)
and revise the wording accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eaee24cb-1189-4825-8bba-e42087fe803f

📥 Commits

Reviewing files that changed from the base of the PR and between f150e5b and 610b265.

📒 Files selected for processing (24)

cmd/engine/main.go
cmd/server/main.go
internal/api/server.go
internal/api/treewalk.go
internal/api/treewalk_test.go
internal/config/config.go
internal/config/config_pageindex_test.go
internal/config/config_treewalk_test.go
internal/handler/answer_treewalk.go
internal/handler/query.go
internal/handler/query_strategy_test.go
internal/handler/router.go
internal/handler/router_parity_test.go
pkg/config/config.go
pkg/config/config_test.go
pkg/ingest/ingest.go
pkg/ingest/minimal_mode_test.go
pkg/ingest/toc_builder.go
pkg/retrieval/auto.go
pkg/retrieval/auto_test.go
pkg/retrieval/strategy.go
pkg/retrieval/treewalk_strategy.go
pkg/retrieval/treewalk_strategy_test.go
pkg/tree/tree.go

💤 Files with no reviewable changes (1)

internal/config/config_pageindex_test.go

coderabbitai · 2026-06-06T12:50:31Z

+	strat.OnEvent = func(ev retrieval.TreeWalkEvent) {
 		emitSSE(ev.Type, ev)
 	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't stream raw TreeWalk events unless reasoning was requested.

serveStream ignores body.IncludeReasoning and forwards the full retrieval.TreeWalkEvent as the SSE payload. That exposes intermediate Reasoning/Answer fields to any streaming caller, which breaks the endpoint's opt-in reasoning contract.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/answer_treewalk.go` around lines 260 - 262, serveStream currently forwards the whole retrieval.TreeWalkEvent via strat.OnEvent/emitSSE regardless of body.IncludeReasoning, leaking Reasoning/Answer; change the strat.OnEvent handler (and the emit path) to check the request flag (body.IncludeReasoning) and, when false, send a redacted payload that omits or zeroes sensitive fields (e.g., TreeWalkEvent.Reasoning, TreeWalkEvent.Answer) or build a separate minimal event type to emit; ensure the decision is done inside the strat.OnEvent closure (or before calling emitSSE) so raw TreeWalkEvent is never streamed unless IncludeReasoning is true.

… rename) Addresses the Copilot + CodeRabbit review on #35: - auto: non-paged docs (no PageStart/PageEnd — Markdown/HTML/DOCX/TXT) now route to single-pass, never treewalk (which navigates by page range) + nil-safe pick/Select/SelectWithCost so a misconfigured Auto can't panic. - complete the rename in non-Go files: openapi.yaml + config.example*.yaml (incl. the uppercase env-var refs) — zero pageindex remains anywhere. - query.go: list 'auto' in the strategy-override doc comment. - config_test: exercise 'auto' as a valid Validate() strategy. - tests: non-paged fallback + nil-safety. Out of scope (filed separately): treewalk SSE streams reasoning regardless of IncludeReasoning (pre-existing behaviour, not introduced here).

hallelx2 · 2026-06-06T13:01:06Z

Thanks @copilot-pull-request-reviewer and @coderabbitai — addressed in dd6697b:

Fixed (in scope):

auto.go — non-paged docs (no PageStart/PageEnd) now route to single-pass, never treewalk (treewalk navigates by page range); pick/Select/SelectWithCost are nil-safe. Added tests for both.
Rename completed in non-Go files — openapi.yaml + config.example*.yaml (incl. uppercase env-var refs). Zero pageindex remains anywhere now.
query.go — strategy-override comment now lists auto.
config_test.go — Validate() test now exercises auto.

Deferred (out of scope for this rename PR, tracked as HAL-115):

treewalk serveStream forwarding reasoning regardless of IncludeReasoning is pre-existing behaviour, not introduced here. Filed as its own issue to fix deliberately with tests.

Build + go vet + package tests all green.

coderabbitai · 2026-06-06T13:01:34Z

Tip

For best results, initiate chat on the files or code changes.

@hallelx2 Thanks for the detailed summary! The in-scope fixes look solid — particularly the auto.go improvements (non-paged docs routing to single-pass, nil-safe pick/Select/SelectWithCost) and completing the rename across the non-Go config/OpenAPI files. Good call deferring the serveStream + IncludeReasoning behaviour to a dedicated issue with proper test coverage.

Would you like me to:

Open a GitHub issue for HAL-115 (treewalk serveStream forwards reasoning events regardless of IncludeReasoning) as a follow-up so it's tracked publicly in this repo?
Trigger a re-review of the latest commit (dd6697b) to verify everything looks good?

Just let me know!

🐇✨

…ump, lint - toc_builder: fix the data race in verifyTitlesConcurrent — each goroutine now accumulates a local Usage and folds it under the mutex, instead of sharing &localUse across errgroup goroutines (concurrent usage.add writes). - deps: golang.org/x/net v0.53.0 -> v0.55.0 (GO-2026-5030); add toolchain go1.25.11 + pin CI go-version (stdlib GO-2026-5039 / GO-2026-5037). - lint: //lint:ignore U1000 the four staged helpers with reasons; drop the genuinely-unused idempotencyCache.mu field. These CI failures were pre-existing (surfaced on PR #35, not caused by the rename). go build + staticcheck green locally; -race verified in CI. Closes HAL-116.

hallelx2 · 2026-06-06T14:03:53Z

Pushed 45edc12 — greens the pre-existing CI baseline (these failures were on main, not introduced by the rename; tracked as HAL-116):

test data race — fixed a real race in toc_builder.verifyTitlesConcurrent (per-goroutine Usage merged under the lock).
govulncheck — x/net→v0.55.0 + toolchain go1.25.11 for the stdlib CVEs.
lint — //lint:ignore U1000 the 4 staged helpers (with reasons); removed the unused idempotencyCache.mu.

go build + staticcheck clean locally; -race verified by CI. Closes HAL-116.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

openapi.yaml (1)

306-310: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Finish the rename in the endpoint description.

Line 310 still says page-index responses uniformly, so the spec does not actually complete the PageIndex → TreeWalk rename.

Suggested diff

-            /v1/replay accepts page-index responses uniformly.
+            /v1/replay accepts TreeWalk responses uniformly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openapi.yaml` around lines 306 - 310, The endpoint description still uses the
old term "page-index responses uniformly" — update the wording in the
Differentiators block for the /v1/replay endpoint to complete the PageIndex →
TreeWalk rename (e.g., replace "page-index responses uniformly" with "TreeWalk
responses uniformly" or another consistent "TreeWalk" phrasing), and scan the
same description for any remaining PageIndex occurrences to ensure all
references (PageIndex, page-index) are renamed consistently to TreeWalk across
that endpoint's documentation.

config.example.yaml (1)

98-113: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The example config still opts users out of the new default.

This block still advertises chunked-tree as the default and leaves strategy: "chunked-tree" in place. Anyone bootstrapping from this file will miss the new auto routing behavior the PR is introducing.

Suggested diff

-  # strategy: single-pass | chunked-tree | agentic | treewalk
+  # strategy: auto | single-pass | chunked-tree | agentic | treewalk
   #
+  #   auto:         choose single-pass for small/non-paged docs and
+  #                 treewalk for larger paged docs.
   #   single-pass:  whole tree in one LLM call; fastest, smallest docs.
   #   chunked-tree: split the tree, reason over slices in parallel, merge.
-  #                 The default. Scales to any tree size by trading
+  #                 Scales to any tree size by trading
   #                 context for parallelism.
   #   agentic:      iterative outline → expand → read → done loop.
   #                 Picks per-section IDs via a tool-using model.
   #   treewalk:    TreeWalk-style page-based agentic loop. Three
   #                 tools (get_document_structure / get_pages / done);
   #                 the model navigates by INCLUSIVE PAGE RANGE
   #                 rather than by section ID. Best for paginated
   #                 documents (SEC filings, academic PDFs) where the
   #                 per-section interface is too noisy.
-  strategy: "chunked-tree"
+  strategy: "auto"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config.example.yaml` around lines 98 - 113, Update the example config to use
the new default routing by replacing the strategy value "chunked-tree" with
"auto" and adjust the surrounding comment to describe the "auto" behavior (and
that it is now the default) so users bootstrapping from config.example.yaml opt
into the new automatic routing; locate the strategy key (strategy:
"chunked-tree") and the explanatory comment block listing strategy options and
update both the value and the description text to reference "auto" as the
default and briefly summarize what "auto" does.

pkg/config/config_test.go (1)

783-807: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Isolate the bad-env test from the VLS_ alias too.

applyEnvOverrides() reads VLE_RETRIEVAL_TREEWALK_MAX_CITATIONS before VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS. If the VLE_ value here is garbage but the ambient VLS_ alias is set, this test stops observing the default 3 and becomes environment-dependent.

Suggested diff

 func TestTreeWalkEnvOverrideRejectsBad(t *testing.T) {
 	prevHops := os.Getenv("VLE_RETRIEVAL_TREEWALK_MAX_HOPS")
 	prevLimit := os.Getenv("VLE_RETRIEVAL_TREEWALK_PAGE_CONTENT_LIMIT")
 	prevCits := os.Getenv("VLE_RETRIEVAL_TREEWALK_MAX_CITATIONS")
+	prevVLSCits := os.Getenv("VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS")
 	defer func() {
 		os.Setenv("VLE_RETRIEVAL_TREEWALK_MAX_HOPS", prevHops)
 		os.Setenv("VLE_RETRIEVAL_TREEWALK_PAGE_CONTENT_LIMIT", prevLimit)
 		os.Setenv("VLE_RETRIEVAL_TREEWALK_MAX_CITATIONS", prevCits)
+		os.Setenv("VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS", prevVLSCits)
 	}()
 
 	os.Setenv("VLE_RETRIEVAL_TREEWALK_MAX_HOPS", "abc")
 	os.Setenv("VLE_RETRIEVAL_TREEWALK_PAGE_CONTENT_LIMIT", "not-a-number")
 	os.Setenv("VLE_RETRIEVAL_TREEWALK_MAX_CITATIONS", "lots")
+	os.Unsetenv("VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/config/config_test.go` around lines 783 - 807,
TestTreeWalkEnvOverrideRejectsBad depends on the ambient VLS_ alias and
applyEnvOverrides reads VLE_ before VLS_, so when VLE_ is garbage but VLS_ is
set the test is flaky; update the test to also capture and restore the VLS_
environment variables (e.g. VLS_RETRIEVAL_TREEWALK_MAX_HOPS,
VLS_RETRIEVAL_TREEWALK_PAGE_CONTENT_LIMIT, VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS)
like the existing VLE_ saves, and explicitly unset them (or set to empty) before
calling Default() and applyEnvOverrides() so the test always exercises
applyEnvOverrides with only the VLE_ garbage values.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openapi.yaml`:
- Around line 666-668: The OpenAPI enum for QueryResponse.strategy omits the
newly supported "auto" strategy; update the schema so the enum for the
"strategy" property on QueryResponse includes "auto" (i.e., enum: [single-pass,
chunked-tree, agentic, treewalk, auto]) to match server behavior and avoid
client/schema validation errors when the handler emits "auto".

---

Outside diff comments:
In `@config.example.yaml`:
- Around line 98-113: Update the example config to use the new default routing
by replacing the strategy value "chunked-tree" with "auto" and adjust the
surrounding comment to describe the "auto" behavior (and that it is now the
default) so users bootstrapping from config.example.yaml opt into the new
automatic routing; locate the strategy key (strategy: "chunked-tree") and the
explanatory comment block listing strategy options and update both the value and
the description text to reference "auto" as the default and briefly summarize
what "auto" does.

In `@openapi.yaml`:
- Around line 306-310: The endpoint description still uses the old term
"page-index responses uniformly" — update the wording in the Differentiators
block for the /v1/replay endpoint to complete the PageIndex → TreeWalk rename
(e.g., replace "page-index responses uniformly" with "TreeWalk responses
uniformly" or another consistent "TreeWalk" phrasing), and scan the same
description for any remaining PageIndex occurrences to ensure all references
(PageIndex, page-index) are renamed consistently to TreeWalk across that
endpoint's documentation.

In `@pkg/config/config_test.go`:
- Around line 783-807: TestTreeWalkEnvOverrideRejectsBad depends on the ambient
VLS_ alias and applyEnvOverrides reads VLE_ before VLS_, so when VLE_ is garbage
but VLS_ is set the test is flaky; update the test to also capture and restore
the VLS_ environment variables (e.g. VLS_RETRIEVAL_TREEWALK_MAX_HOPS,
VLS_RETRIEVAL_TREEWALK_PAGE_CONTENT_LIMIT, VLS_RETRIEVAL_TREEWALK_MAX_CITATIONS)
like the existing VLE_ saves, and explicitly unset them (or set to empty) before
calling Default() and applyEnvOverrides() so the test always exercises
applyEnvOverrides with only the VLE_ garbage values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca0e5052-7ad9-42c3-ab14-a52aac0e4eca

📥 Commits

Reviewing files that changed from the base of the PR and between 610b265 and 45edc12.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (15)

.github/workflows/ci.yml
config.example.yaml
config.server.example.yaml
go.mod
internal/api/treewalk.go
internal/handler/query.go
internal/middleware/cors.go
internal/middleware/idempotency.go
openapi.yaml
pkg/config/config_test.go
pkg/ingest/ingest.go
pkg/ingest/toc_builder.go
pkg/parser/pdf.go
pkg/retrieval/auto.go
pkg/retrieval/auto_test.go

✅ Files skipped from review due to trivial changes (5)

config.server.example.yaml
internal/middleware/cors.go
internal/handler/query.go
pkg/parser/pdf.go
pkg/ingest/ingest.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/retrieval/auto.go
internal/api/treewalk.go

coderabbitai · 2026-06-06T14:09:47Z

        strategy:
          type: string
-          enum: [single-pass, chunked-tree, agentic, pageindex]
+          enum: [single-pass, chunked-tree, agentic, treewalk]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add auto to QueryResponse.strategy.

auto is now a supported first-class retrieval strategy, but the response enum still excludes it. That leaves the published contract narrower than the server/config behavior and can break generated clients or schema validation if the handler serializes the selected strategy name directly.

Suggested diff

- enum: [single-pass, chunked-tree, agentic, treewalk] + enum: [auto, single-pass, chunked-tree, agentic, treewalk]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

strategy:

type: string

enum: [single-pass, chunked-tree, agentic, pageindex]

enum: [single-pass, chunked-tree, agentic, treewalk]

strategy:

type: string

enum: [auto, single-pass, chunked-tree, agentic, treewalk]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openapi.yaml` around lines 666 - 668, The OpenAPI enum for QueryResponse.strategy omits the newly supported "auto" strategy; update the schema so the enum for the "strategy" property on QueryResponse includes "auto" (i.e., enum: [single-pass, chunked-tree, agentic, treewalk, auto]) to match server behavior and avoid client/schema validation errors when the handler emits "auto".

- pkg/parser: serialize pdftable.OpenBytes behind a package mutex (openPDFBytes). pdftable mutates package-level state on open and is not concurrency-safe; the -race detector flagged concurrent opens (real in prod — ingest workers parse in parallel). Stopgap; the proper fix belongs in pdftable (Foundational Libs). - convert the unused-helper suppressions from staticcheck //lint:ignore to golangci-lint //nolint:unused,staticcheck (CI lint uses golangci-lint). go build + go vet + parser tests green.

…heck (HAL-117) Completes the errcheck/ineffassign cleanup across the repo, including the renamed files (treewalk.go / answer_treewalk.go) the off-main pass couldn't reach: - errcheck: _, _ = fmt.Fprintf on the SSE writers; unchecked returns handled across cmd/benchmark, handlers, connecthandler, parser, storage, queue, db, ingest. - ineffassign: drop the ineffectual 'written += remaining' before break in treewalk_strategy. - staticcheck: real TLS-disabled assertion (was an empty branch) in config_test; De Morgan's rewrite in retrieval_test; drop embedded-field selector in decompose_test. - misspell: ignore-rule for 'strat' (our 'strategy' abbreviation) in .golangci.yml. - formatting normalized via the golangci formatters. golangci-lint run: 0 issues. go build + unit tests green. Closes HAL-117.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/benchmark/main.go`:
- Around line 104-110: Benchmark currently records durations for failed requests
(e.g., the REST call via restGET and gRPC via healthClient.Check) which skews
latency stats; change each measured block (places using restGET,
healthClient.Check and similar REST/gRPC calls that append to
r.rest/r.grpc/r.rerpc etc.) to capture and check the error/result and only
append time.Since(start) when the call succeeded, otherwise increment a failure
counter (e.g., r.restFailures/r.grpcFailures or a shared successes/failures map)
so latency arrays contain only successful-sample durations and failures are
tracked separately; update all similar blocks (the other ranges mentioned) to
follow the same pattern using the existing variables like ctx,
connect.NewRequest, r.rest and r.grpc to locate the spots to change.
- Line 80: The json.Unmarshal call in cmd/benchmark/main.go is ignoring errors
which can leave sectionID empty and cause the "Get Section" benchmark to be
silently skipped; change the unmarshalling of treeResp into treeData (the
json.Unmarshal(treeResp, &treeData) call) to check the returned error, and on
failure log an explicit setup error (including the error detail) and mark the
benchmark run as failed or abort further section selection (e.g., return a
non-zero exit or set a setupFailure flag) so the missing sectionID is not
treated as a successful run; ensure downstream code that uses sectionID (section
selection / "Get Section" benchmark) respects this failure signal.

In `@internal/api/server.go`:
- Around line 395-399: The ReadAll result is used even when io.ReadAll returns
an error; update the rc handling (the block after d.Storage.Get returning err ==
nil where variables rc, raw, content are used) to check the error returned by
io.ReadAll, close rc best-effort, and discard the section by setting content =
"" (or skip adding the section) when ReadAll fails so truncated/corrupted bytes
aren’t used; apply the same fix to the other similar blocks referenced (the
occurrences around lines 538-542, 835-839, 1077-1081) to consistently handle
ReadAll errors.

In `@internal/handler/answer.go`:
- Around line 169-173: The code ignores the error returned by io.ReadAll after
h.storage.Get, which can yield truncated content; change the block around
h.storage.Get / io.ReadAll to check the read error (the second return from
io.ReadAll), ensure rc is closed (best-effort) even on read failure, and if
io.ReadAll returns an error do not use the partial raw content — either skip
adding this section to content or return/propagate an error; update the handling
where variables rc, raw, getErr are used so only successfully-read data is
assigned to content.

In `@internal/handler/documents.go`:
- Around line 428-431: The code currently ignores errors from io.ReadAll and may
return truncated section content; change the read to capture the error (raw,
readErr := io.ReadAll(rc)) and if readErr != nil then ensure rc is closed and
propagate a read/storage error instead of serializing partial bytes — e.g.,
log/report readErr and return an appropriate HTTP error (or skip including the
section body) so that h.storage.Get / sec.ContentRef reads never produce a 200
with truncated data; update uses of rc.Close() accordingly to always close in
both success and error paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89368ae7-3f90-426d-899e-b14165d27cc8

📥 Commits

Reviewing files that changed from the base of the PR and between 45edc12 and 7407704.

📒 Files selected for processing (42)

.golangci.yml
cmd/benchmark/main.go
cmd/engine/main.go
cmd/server/main.go
internal/api/replay_test.go
internal/api/server.go
internal/api/treewalk.go
internal/api/treewalk_test.go
internal/connecthandler/documents.go
internal/connecthandler/query.go
internal/handler/answer.go
internal/handler/answer_treewalk.go
internal/handler/documents.go
internal/handler/query.go
internal/handler/query_multi.go
internal/handler/query_strategy_test.go
internal/handler/query_stream.go
internal/handler/query_stream_multi.go
internal/middleware/cors.go
internal/middleware/idempotency.go
pkg/cache/cache_test.go
pkg/config/config_test.go
pkg/db/db.go
pkg/ingest/hyde.go
pkg/ingest/ingest.go
pkg/ingest/minimal_mode_test.go
pkg/ingest/toc_builder.go
pkg/parser/chunk_test.go
pkg/parser/docx.go
pkg/parser/pdf.go
pkg/queue/asynq.go
pkg/queue/asynq_test.go
pkg/queue/qstash.go
pkg/queue/river_test.go
pkg/retrieval/agentic.go
pkg/retrieval/decompose_test.go
pkg/retrieval/retrieval_test.go
pkg/retrieval/single_pass.go
pkg/retrieval/treewalk_strategy.go
pkg/storage/local.go
pkg/storage/s3_test.go
pkg/tree/compact_test.go

💤 Files with no reviewable changes (1)

pkg/retrieval/treewalk_strategy.go

✅ Files skipped from review due to trivial changes (13)

.golangci.yml
pkg/retrieval/agentic.go
internal/handler/query_multi.go
pkg/retrieval/single_pass.go
pkg/queue/qstash.go
pkg/tree/compact_test.go
internal/middleware/cors.go
pkg/queue/asynq.go
internal/handler/query_stream_multi.go
pkg/ingest/hyde.go
pkg/parser/docx.go
internal/connecthandler/query.go
pkg/parser/chunk_test.go

🚧 Files skipped from review as they are similar to previous changes (9)

internal/handler/query.go
pkg/ingest/minimal_mode_test.go
internal/middleware/idempotency.go
pkg/ingest/toc_builder.go
internal/handler/query_strategy_test.go
internal/api/treewalk_test.go
cmd/engine/main.go
internal/api/treewalk.go
pkg/config/config_test.go

coderabbitai · 2026-06-06T16:13:49Z

 		} `json:"sections"`
 	}
-	json.Unmarshal(treeResp, &treeData)
+	_ = json.Unmarshal(treeResp, &treeData) // benchmark best-effort parse


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle tree parse failure explicitly before section benchmark selection.

If json.Unmarshal fails, sectionID remains empty and “Get Section” benchmarking is silently skipped. Log/track this as a failed setup step so results aren’t misread as complete.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/benchmark/main.go` at line 80, The json.Unmarshal call in cmd/benchmark/main.go is ignoring errors which can leave sectionID empty and cause the "Get Section" benchmark to be silently skipped; change the unmarshalling of treeResp into treeData (the json.Unmarshal(treeResp, &treeData) call) to check the returned error, and on failure log an explicit setup error (including the error detail) and mark the benchmark run as failed or abort further section selection (e.g., return a non-zero exit or set a setupFailure flag) so the missing sectionID is not treated as a successful run; ensure downstream code that uses sectionID (section selection / "Get Section" benchmark) respects this failure signal.

coderabbitai · 2026-06-06T16:13:50Z

+		_, _ = restGET(base + "/v1/health")
 		r.rest = append(r.rest, time.Since(start))

 		// gRPC (Connect)
 		start = time.Now()
-		healthClient.Check(ctx, connect.NewRequest(&v1.HealthCheckRequest{}))
+		_, _ = healthClient.Check(ctx, connect.NewRequest(&v1.HealthCheckRequest{}))
 		r.grpc = append(r.grpc, time.Since(start))


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t include failed calls in latency samples.

All benchmark paths discard request errors, so network/protocol failures are still timed and recorded as “successful” latency. This skews median/p95 and can invert REST vs gRPC conclusions. Track failures separately and only append durations for successful calls (or report success-rate alongside latency).

Also applies to: 121-126, 137-144, 156-163, 185-194

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/benchmark/main.go` around lines 104 - 110, Benchmark currently records durations for failed requests (e.g., the REST call via restGET and gRPC via healthClient.Check) which skews latency stats; change each measured block (places using restGET, healthClient.Check and similar REST/gRPC calls that append to r.rest/r.grpc/r.rerpc etc.) to capture and check the error/result and only append time.Since(start) when the call succeeded, otherwise increment a failure counter (e.g., r.restFailures/r.grpcFailures or a shared successes/failures map) so latency arrays contain only successful-sample durations and failures are tracked separately; update all similar blocks (the other ranges mentioned) to follow the same pattern using the existing variables like ctx, connect.NewRequest, r.rest and r.grpc to locate the spots to change.

coderabbitai · 2026-06-06T16:13:50Z

 		rc, _, err := d.Storage.Get(r.Context(), sec.ContentRef)
 		if err == nil {
 			raw, _ := io.ReadAll(rc)
-			rc.Close()
+			_ = rc.Close() // best-effort close
 			content = string(raw)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Discard section content when storage reads fail.

These paths ignore io.ReadAll errors and still use whatever bytes were returned. A short read will quietly surface truncated text in /v1/sections, /v1/query, /v1/answer, and /v1/query/multi, which can ground answers on corrupted evidence instead of failing or dropping the section.

Representative fix

- raw, _ := io.ReadAll(rc) - _ = rc.Close() // best-effort close - content = string(raw) + raw, readErr := io.ReadAll(rc) + _ = rc.Close() // best-effort close + if readErr == nil { + content = string(raw) + }

Also applies to: 538-542, 835-839, 1077-1081

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/api/server.go` around lines 395 - 399, The ReadAll result is used even when io.ReadAll returns an error; update the rc handling (the block after d.Storage.Get returning err == nil where variables rc, raw, content are used) to check the error returned by io.ReadAll, close rc best-effort, and discard the section by setting content = "" (or skip adding the section) when ReadAll fails so truncated/corrupted bytes aren’t used; apply the same fix to the other similar blocks referenced (the occurrences around lines 538-542, 835-839, 1077-1081) to consistently handle ReadAll errors.

coderabbitai · 2026-06-06T16:13:50Z

 			rc, _, getErr := h.storage.Get(r.Context(), sec.ContentRef)
 			if getErr == nil {
 				raw, _ := io.ReadAll(rc)
-				rc.Close()
+				_ = rc.Close() // best-effort close
 				content = string(raw)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't synthesize from partially read section bodies.

io.ReadAll errors are ignored here, so a short storage read still becomes citation / answer input. That can return a grounded answer from truncated evidence instead of skipping the section or failing the request.

Representative fix

- raw, _ := io.ReadAll(rc) - _ = rc.Close() // best-effort close - content = string(raw) + raw, readErr := io.ReadAll(rc) + _ = rc.Close() // best-effort close + if readErr != nil { + h.logger.Warn("answer: section content read failed", "section_id", sec.ID, "err", readErr) + continue + } + content = string(raw)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/answer.go` around lines 169 - 173, The code ignores the error returned by io.ReadAll after h.storage.Get, which can yield truncated content; change the block around h.storage.Get / io.ReadAll to check the read error (the second return from io.ReadAll), ensure rc is closed (best-effort) even on read failure, and if io.ReadAll returns an error do not use the partial raw content — either skip adding this section to content or return/propagate an error; update the handling where variables rc, raw, getErr are used so only successfully-read data is assigned to content.

coderabbitai · 2026-06-06T16:13:50Z

 		rc, _, getErr := h.storage.Get(r.Context(), sec.ContentRef)
 		if getErr == nil {
 			raw, _ := io.ReadAll(rc)
-			rc.Close()
+			_ = rc.Close() // best-effort close


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid returning partial section content on read errors.

This handler ignores io.ReadAll errors and still serializes whatever bytes were fetched. A short storage read becomes a 200 with truncated section text, which is indistinguishable from a real section body to clients.

Representative fix

- raw, _ := io.ReadAll(rc) - _ = rc.Close() // best-effort close - content = string(raw) + raw, readErr := io.ReadAll(rc) + _ = rc.Close() // best-effort close + if readErr == nil { + content = string(raw) + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/documents.go` around lines 428 - 431, The code currently ignores errors from io.ReadAll and may return truncated section content; change the read to capture the error (raw, readErr := io.ReadAll(rc)) and if readErr != nil then ensure rc is closed and propagate a read/storage error instead of serializing partial bytes — e.g., log/report readErr and return an appropriate HTTP error (or skip including the section body) so that h.storage.Get / sec.ContentRef reads never produce a 200 with truncated data; update uses of rc.Close() accordingly to always close in both success and error paths.

…6/117) - lint: the CI 'lint' job runs BOTH golangci-lint and standalone staticcheck (honnef), and the latter reads //lint:ignore, not golangci's //nolint. Rather than juggle two directive formats, delete the 4 genuinely-unused funcs (treeWalkTraceTokenFromCitations, originMatches, isTimeout, isEncryptedPDFError) — both linters agree they're dead. (Re-add with callers when HAL-73/HAL-112 land.) - race: pdftable.ExtractTables also mutates package-level state — a second pdftable race distinct from OpenBytes; serialize it on the same mutex. Stopgap; the root fix is HAL-118 (make pdftable concurrency-safe). golangci-lint + standalone staticcheck: 0 issues. go build + unit tests green.

The ubuntu-only 'Verify go.mod/go.sum are tidy' CI step requires the canonical format go mod tidy emits: a blank line between the 'go' and 'toolchain' directives. My earlier manual insertion of the toolchain line omitted it.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/parser/pdf.go (1)
1605-1637: ⚠️ Potential issue | 🔴 Critical

Fix: safeExtractTables can deadlock the global pdftableOpenMu on timeout/hang or panic

safeExtractTables locks pdftableOpenMu immediately before page.ExtractTables(settings), but on timeout the caller returns while the goroutine keeps running; if ExtractTables wedges, the global lock stays held and blocks later openPDFBytes/table extraction.

If page.ExtractTables panics, the goroutine’s recover runs but the mutex unlock is not deferred, so pdftableOpenMu.Unlock() is skipped and the mutex remains permanently locked.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 1605 - 1637, safeExtractTables currently
locks pdftableOpenMu and then calls page.ExtractTables without deferring the
unlock, which can permanently lock the mutex on panic or timeout; fix by moving
pdftableOpenMu.Lock() into the goroutine and immediately calling defer
pdftableOpenMu.Unlock() (before invoking page.ExtractTables) so the mutex is
always released on return/recover, and ensure the goroutine still sends result
on the done channel; keep references to pdftableOpenMu, safeExtractTables,
page.ExtractTables, and the done/result channel when making this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/parser/pdf.go`:
- Around line 1605-1637: safeExtractTables currently locks pdftableOpenMu and
then calls page.ExtractTables without deferring the unlock, which can
permanently lock the mutex on panic or timeout; fix by moving
pdftableOpenMu.Lock() into the goroutine and immediately calling defer
pdftableOpenMu.Unlock() (before invoking page.ExtractTables) so the mutex is
always released on return/recover, and ensure the goroutine still sends result
on the done channel; keep references to pdftableOpenMu, safeExtractTables,
page.ExtractTables, and the done/result channel when making this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 26e5083c-3a99-44a3-a0fd-78159420f3dd

📥 Commits

Reviewing files that changed from the base of the PR and between 7407704 and 18b5c1d.

📒 Files selected for processing (5)

go.mod
internal/api/treewalk.go
internal/middleware/cors.go
pkg/ingest/ingest.go
pkg/parser/pdf.go

💤 Files with no reviewable changes (3)

internal/middleware/cors.go
pkg/ingest/ingest.go
internal/api/treewalk.go

🚧 Files skipped from review as they are similar to previous changes (1)

go.mod

hallelx2 added 2 commits June 5, 2026 22:42

Copilot AI review requested due to automatic review settings June 6, 2026 12:39

sourcery-ai Bot reviewed Jun 6, 2026

View reviewed changes

Copilot started reviewing on behalf of hallelx2 June 6, 2026 12:39 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

Copilot started work on behalf of hallelx2 June 6, 2026 13:01 View session

Copilot finished work on behalf of hallelx2 June 6, 2026 13:02

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

hallelx2 added 2 commits June 6, 2026 15:28

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

hallelx2 added 2 commits June 6, 2026 17:24

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

hallelx2 merged commit 2051693 into main Jun 6, 2026
8 checks passed

hallelx2 deleted the halleluyaholudele/hal-106-treewalk-rename branch June 6, 2026 20:09

		for _, s := range []string{"single-pass", "chunked-tree", "agentic", "treewalk"} {
		cfg := Default()

Conversation

hallelx2 commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in here

Verification

Not in here (tracked separately)

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hallelx2 commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026

Uh oh!

hallelx2 commented Jun 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading