Skip to content

fix(usn): live journal correctness + near-live search freshness + ext-index/metadata backfill#475

Merged
githubrobbi merged 11 commits into
mainfrom
fix/update-daemon-logging-and-usn-ext-index
Jun 26, 2026
Merged

fix(usn): live journal correctness + near-live search freshness + ext-index/metadata backfill#475
githubrobbi merged 11 commits into
mainfrom
fix/update-daemon-logging-and-usn-ext-index

Conversation

@githubrobbi

Copy link
Copy Markdown
Collaborator

Summary

Fixes a chain of correctness and freshness bugs in the live USN-journal path that surfaced during the v0.6.13 Windows test runs, plus the secondary issues found along the way. Validated end-to-end on a live Windows MFT (7 drives, 26M records): creates, renames, deletes, --ext filtering, and metadata all resolve correctly within seconds.

What was broken

  1. Search-visibility lag (the headline fix). The live USN path buffered changes in accept but only patched the searchable in-memory body inside trigger_save, which fires solely on the 50k-event / 5-min save threshold. A freshly created/renamed/deleted file was invisible to search for up to 5 minutes.
  2. Delete / rename / FRS-reuse were mishandled in the journal apply (net-state resolution + slot overwrite/re-intern).
  3. Journal-created files had no extension index entry (--ext pdf missed them) and no real size/timestamps (size 0, 1601-epoch) until a full reload.
  4. --ext + --limit returned empty when the record-level filter ran after the per-drive limit cutoff.
  5. Self-update left the relaunched daemon attached + leftover diag prints.
  6. RUSTSEC-2026-0186 (memmap2 unsoundness).

Key change: decoupled apply / save cadence

The fix splits the body-apply cadence from the disk-save cadence:

  • New ApplyTrigger + PatchSink::trigger_apply + IndexManager::handle_journal_apply patch + Arc-swap the in-memory body without the compact-cache disk write or cursor persist.
  • trigger_save keeps the full patch-plus-persist on its rare cadence. A save subsumes an apply, so fire_due_cadence fires at most one per poll.
  • Only a real body save advances the on-disk cursor → cold-boot replay window unchanged, idempotent.

Performance: search/read/poll hot paths are untouched. Each apply rebuilds the children CSR + trigram + ext index (~600ms/7M records, independent of batch size), so the default apply interval is 30s (UFFS_USN_APPLY_INTERVAL_MS) — sustained churn stays background noise (~2% of a core/drive) while the first change after any quiet period still applies within a poll or two. The long-documented-but-dead UFFS_USN_POLL_INTERVAL_MS override is now wired too.

The cadence state machines moved to journal_loop/triggers.rs (file-size policy).

Tests

  • 8 apply-cadence tests (mutual-exclusion, idle no-op, end-to-end near-live apply without save, full ApplyTrigger state machine).
  • 3 sink tests (Apply-message drain, empty no-op, cursor-never-persisted-on-apply).
  • USN net-state resolution + apply overwrite/re-intern + metadata backfill tests.
  • Search filter-before-limit regression tests.
  • Existing threshold/wrap tests pin apply disabled so save behaviour is unchanged.

Tooling

scripts/windows/usn-verify.rs — a controlled, repeatable create/search/rename/delete harness that captures each search to CSV and extracts the per-poll USN dispositions, pinned to uffs.exe (so a shared transcript can't hit the C++ uffs.com via PATHEXT).

Validation (live Windows MFT)

Op Result
Create → name + --ext pdf/dll/log ✅ all 4 searchable within ~1–3s
Metadata backfill ✅ size=5000, real timestamps (not 1601)
Rename charlie.log→.pdf ✅ moves into --ext pdf, out of --ext log
Delete bravo.dll ✅ drops from --ext dll

🤖 Generated with Claude Code

githubrobbi and others added 11 commits June 24, 2026 21:49
After a v0.6.12 to v0.6.13 self-update on Windows, the daemon's INFO logs
and a batch of [diag] lines poured into the user's interactive console.

Root cause 1 - non-detached restart: the updater's restore path replays
the daemon's captured argv with a bare Command::spawn(), inheriting the
updater's console. The normal --daemon start path detaches and nulls
stdio via the client spawn helper; the restart bypassed it. Add
detach_stdio() to null all three handles and, on Windows, set
DETACHED_PROCESS plus CREATE_NO_WINDOW. The relaunched service no longer
writes to the console.

Root cause 2 - leftover diagnostics: unconditional eprintln diag lines
("remove after D: drive issue") shipped in release and became visible
once the restart inherited the console. Removed them: the load-failure,
panic and timeout ones were already duplicated by adjacent
tracing::error! calls; the startup-parameter dump is now tracing::debug!
(available under --log-level debug, silent by default). Dropped the
now-unfulfilled clippy print_stderr/use_debug expects. The gated diag
block in daemon_mgmt.rs stays - it already only prints under debug.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A file created while the daemon is live (via the USN journal patch path)
was findable by name but invisible to `--ext`. Field report on v0.6.13:
`uffs PCL5-Assessment-2026-06-25.pdf` returned the file, but
`uffs pcl5 --ext pdf` returned nothing.

Root cause: apply_usn_patch's create branch built the new CompactRecord
with a hardcoded `extension_id: 0` (the reserved "no extension" id). The
function then rebuilds the ExtensionIndex from each record's
extension_id, so the new pdf was filed under "no extension"; the `--ext`
fast path (resolve_ext_ids → ExtensionIndex.get) never returned it. The
re-animate branch (FRS reuse after delete) had the same gap, keeping the
slot's pre-delete extension_id.

Fix: add DriveCompactIndex::intern_extension(), mirroring the build-time
MftIndex::intern_extension + ExtensionTable::intern rules (last dot, not
a dotfile, not trailing; lowercase; find-or-append into ext_names, u16
ceiling → 0). Call it on both the create and re-animate paths so the new
record carries the same extension_id a full rebuild would assign.

Tests: apply_usn_patch_created_record_is_findable_by_extension pins the
whole chain (intern → record tag → inverted index) and fails on the old
code; a companion test pins dotless/dotfile creates to the no-ext id
without polluting ext_names.

(Note: USN-created records still carry size 0 / zero timestamps because
the FileChange only conveys name + parent — a separate, pre-existing
limitation, not addressed here.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cargo-deny started failing (advisories) on a freshly-published unsound
advisory: RUSTSEC-2026-0186, unchecked pointer arithmetic in memmap2's
advise_range / unchecked_advise_range / flush_range / flush_async_range.
Pre-existing on main (memmap2 0.9.10 comes in transitively via
polars-io), surfaced now because the advisory just landed.

Patched in 0.9.11. cargo update -p memmap2 --precise 0.9.11 (polars-io's
^0.9 requirement is satisfied; no other resolutions move). Reviewed the
full source diff and recorded a real cargo-vet audit: it is exactly the
remediation (bounds-checked pointer math, unsafe_op_in_unsafe_fn
hardening with SAFETY comments, MSRV 1.63->1.65, a doc typo) with no new
capabilities or deps. cargo vet prune then retired three now-redundant
local exemptions (rmp, rmp-serde, fallible-streaming-iterator) that
upstream trusted audits now cover.

cargo deny: advisories ok, bans ok, licenses ok, sources ok.
cargo vet check: clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Field report (v0.6.13): `uffs pcl5 --ext pdf --limit 5` returned nothing
even though PCL5-Assessment-*.pdf exists, matches the substring, and has
extension pdf. `uffs <name>` found it; adding `--ext pdf` lost it.

Root cause: every non-match-all per-drive collector (substring, prefix,
regex, tree) truncated its match list to `--limit` using only the NAME
predicate, then the extension/size/date/etc. filters ran afterwards in
apply_search_filters on the already-truncated rows. When the matches
that share the substring outnumber the limit before the wanted
extension appears (eight PCL5*.DLL ahead of one PCL5*.pdf), the limit
fills with .dll rows and `--ext pdf` removes all of them. The match-all
(`*`) path never had this bug because it pushes the filters into the
scan via collect_global_top_n.

Fix (option A — push the filters into the scan, like the match-all
path): thread the SearchFilters into search_compact_drive / _prefix /
_regex / _tree. Each resolves the extension filter for its own drive
once (cloned local to keep the parallel scan free of &mut contention,
matching numeric_top_n) and applies SearchFilters::matches_record per
candidate BEFORE the limit cutoff. matches_record is a no-op for an
empty filter set, so unfiltered searches are unchanged. The tree path
over-fetches (limit -> usize::MAX) only when an extension filter is
active — gated on the single stable `extensions` field so it can't
drift out of sync with matches_record — then re-applies the limit.

Test: search_substring_with_extension_filter_survives_limit builds a
drive with eight pcl5 .dll files ahead of one pcl5 .pdf and asserts both
the unlimited query (proves the filter logic is correct) and the
limit=5 query (the regression) return the pdf. Fails on the old code
with `got []`, matching the field report exactly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l path

Windows field testing of live file create/delete/rename surfaced three
USN-delta correctness bugs the static-fixture tests never exercised:

1. aggregate_changes lost event order. It OR-ed independent
   created/deleted/renamed bools, so a delete→recreate that reuses an
   MFT record number (NTFS does this routinely) collapsed to
   created==true && deleted==true. apply_usn_patch checks `deleted`
   first, tombstoning the brand-new file → it vanished ("delta.pdf"
   dropped; recreate-after-delete lost files). Fix: resolve the net
   state in USN (time) order — the LAST create/delete/rename wins — and
   make change_type() prefer DELETE when a single close record carries
   both bits (transient temp file nets to gone).

2. apply_usn_patch skipped a create onto a *live* reused slot. It only
   re-animated tombstones; a `created` event whose FRS still mapped to a
   live (stale) record was dropped. A create always means the record
   number now holds a new file, so overwrite the slot wholesale
   (resetting per-file metrics). Unifies the tombstone + stale-live
   paths via overwrite_slot().

3. rename never re-interned the extension. charlie.log → charlie.pdf
   kept extension_id=log, so `--ext pdf` missed it and `--ext log` still
   matched. Fix: re-intern + refresh name_first_byte on rename.

Refactor: split the per-change dispatch into apply_delete / apply_create
/ apply_rename helpers (retires the too_many_lines + cognitive_complexity
pressure and the now-wrong "don't split" expectation). Add a per-change
TRACE and a per-batch DEBUG summary so the next live investigation can
see each event's disposition with `--log-level debug|trace`.

Tests (the missing coverage that let these ship): aggregate_changes had
ZERO tests — added net-resolution + change_type-priority pins; added
apply-level pins for rename-reinterns-extension and
create-replaces-live-reused-slot. All fail on the old code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
USN records carry only FRS + parent + name, so a file created or renamed
while the daemon is live landed in the index with size 0 and zero
(1601-epoch) timestamps — the "size 0 / 1600-12-31" rows seen in the
field test. The cold/cache-load USN path already recovers real metadata
via a targeted MFT read (reader/usn_apply.rs); only the live journal loop
lacked it.

Add the same recovery to the live path, off the hot path:

- FileChange gains an optional RecordMeta (size/allocated/created/
  modified/accessed/flags), representation identical to CompactRecord so
  it copies straight in. None today = byte-for-byte current behaviour.
- apply_usn_patch's create (append + reused-slot overwrite) and rename
  paths apply meta when present, else keep the zeroed metrics.
- WindowsJournalSource::poll backfills it: collect the created/renamed
  FRSes, VolumeHandle::open (auto-adopts the broker handle, non-elevated,
  the same path the USN read uses), one batched read_targeted_frs_records
  into a scratch MftIndex, then attach each record's metadata.

Hot-path safety: the targeted read runs in poll() on the spawn_blocking
thread, BEFORE the registry write-lock is taken in accept(), so it never
lengthens the lock hold or touches the query path. Best-effort — any
failure (volume open / read) leaves meta None and the record keeps its
zeroed metrics (no crash, no regression). Bounded at 4096 targeted reads
per 500 ms poll so a bulk unzip can't stall one poll; the overflow
backfills next poll. Visible on the same ~500 ms cadence as the rest of
the journal refresh.

Tests: apply_usn_patch_applies_backfilled_metadata pins meta application
across the append, overwrite, and no-meta paths. The Windows enrichment
is cfg(windows); verified via cargo xwin clippy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…efore merge)

The field run showed the live USN path never patches the index until the
50k-event / 5-min save threshold: poll() -> sink.accept() only BUFFERS
into `pending`, and only trigger_save() drains + applies. A 7-second test
never hit the threshold, so newly created files were never searchable
(searches returned cold-loaded $Recycle.Bin stale copies). The earlier
diagnostic directive (uffs_core::compact_loader=trace) also didn't take
effect, hiding apply_usn_patch entirely.

Add INFO-level, grep-able `USNFIX` instrumentation so a live run can
prove the cause without a per-target filter:
  * startup: a build marker line (confirm the running daemon is THIS build)
  * accept(): "BUFFERED only (body not patched until save tick)" per poll
  * trigger_save(): "draining buffer -> apply" when the threshold finally fires
  * apply_usn_patch(): "body patched" with created/deleted/renamed counts

Expected on the next run: repeated USNFIX accept lines every ~500 ms, and
NO USNFIX trigger_save / "body patched" line until 5 min elapses — which
is the visibility-lag bug to fix next (decouple body-apply cadence from
the disk-cache-save threshold).

All four sites carry the literal `USNFIX` marker; `grep -rn USNFIX crates/`
locates them for removal once the fix lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…live

The live USN path buffered changes in `accept` but only patched the
searchable in-memory body inside `trigger_save`, which fires solely on
the 50k-event / 5-min save threshold. A freshly created, renamed, or
deleted file therefore stayed invisible to search for up to 5 minutes.

Decouple the body-apply cadence from the disk-save cadence:

- New `ApplyTrigger` (default 2s, `UFFS_USN_APPLY_INTERVAL_MS`) fires a
  new `PatchSink::trigger_apply` when buffered churn exists, draining the
  same pending buffer into `ApplyMsg::Apply`. The applier patches + swaps
  the in-memory body via the new `IndexManager::handle_journal_apply`,
  WITHOUT the compact-cache disk write or cursor persist.
- `trigger_save` keeps the full patch-plus-persist on its rare cadence.
  A save subsumes an apply (it drains + applies the same buffer), so
  `fire_due_cadence` fires at most one of the two per poll and resets the
  apply trigger after a save.
- Only a real body save advances the on-disk cursor, so a cold start
  re-replays the apply-only deltas from the last saved cursor against the
  freshly loaded body — idempotent, same cold-boot window as before.

`handle_journal_save` is refactored over a shared `apply_to_body` core
(no behaviour change on the save path). The long-documented but never
wired `UFFS_USN_POLL_INTERVAL_MS` override is implemented alongside the
new apply override via `JournalLoopConfig::from_env`.

The cadence state machines + their threshold constants move to a new
`journal_loop/triggers.rs` module, keeping `journal_loop.rs` under the
800-LOC file-size policy.

Tests: 8 apply-cadence tests (process_tick mutual-exclusion, idle no-op,
end-to-end near-live apply without save, full ApplyTrigger state machine)
and 3 sink tests (Apply-message drain, empty no-op, cursor-never-persisted
on apply). All existing threshold/wrap tests pin apply disabled so save
behaviour is unchanged.

scripts/windows/usn-verify.rs: harden the binary pin to uffs.exe and make
the cosmetic `$ ...` echoes copy-paste-safe, so a shared transcript can't
accidentally invoke the C++ `uffs.com` (which shadows `uffs.exe` under
Windows PATHEXT when called as bare `uffs`).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nism

With the apply/save cadence split, a created/renamed/deleted file is now
searchable within the apply interval (default 2s) instead of waiting up
to 5 min for a disk-save tick. The verify harness sets
UFFS_USN_APPLY_INTERVAL_MS=500 on the test daemon so the near-live body
patch fires on essentially the first poll that sees the events, making
the existing 3s POLL_SETTLE windows a comfortable margin rather than a
race against the 2s default. The env var is inherited by the detached
uffsd (CreateProcessW lpEnvironment=NULL), so it reaches
JournalLoopConfig::from_env on the daemon side.

No 5-minute break is needed in the run anymore.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The apply-cadence fix is validated on a live Windows MFT (creates,
renames, deletes, ext-index, and metadata backfill all resolve within
seconds), so the scaffolding added to confirm which build was running and
trace the buffer/apply flow is no longer needed.

Removes: the build marker in startup.rs, the BUFFERED/trigger_save/
trigger_apply INFO traces in journal_sink.rs, and the INFO mirror line in
compact_loader::log_batch_summary. The permanent `usn apply:` /
`usn backfill:` DEBUG/TRACE diagnostics stay.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each apply rebuilds the children CSR + trigram + extension index (~600ms
on a 7M-record drive, independent of batch size). At a 2s default that is
up to ~30% of a background core per drive under sustained churn. Raise the
default to 30s so a constantly-changing filesystem keeps the rebuild as
background noise (~600ms / 30s ≈ 2% of a core) rather than a continuous
drag.

This does not blunt the common case: the trigger fires as soon as the
interval has elapsed since the last apply, so the first change after any
quiet period is still applied within a poll or two — only sustained,
back-to-back churn is batched onto the 30s cadence. The dial stays
runtime-overridable via UFFS_USN_APPLY_INTERVAL_MS (the verify harness
pins it to 500ms for its short rounds).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@githubrobbi githubrobbi merged commit 774783a into main Jun 26, 2026
22 checks passed
@githubrobbi githubrobbi deleted the fix/update-daemon-logging-and-usn-ext-index branch June 26, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant