Skip to content

PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48

Open
bdchatham wants to merge 10 commits into
mainfrom
brandon2/plt-458-open-loop-scheduler
Open

PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48
bdchatham wants to merge 10 commits into
mainfrom
brandon2/plt-458-open-loop-scheduler

Conversation

@bdchatham

Copy link
Copy Markdown
Contributor

Implements PLT-458 — the core of the coordinated-omission fix. Issues at t_i = t₀ + i/λ independent of in-flight completion, so latency no longer hides the backlog.

What

  • New sender/scheduler.go (openLoopScheduler): arrival clock at t₀ + i/λ (absolute-instant sleep → drift-free), stamps IntendedSendTime at the true scheduled instant + a per-tx SequenceIndex. Workers become pure async senders (the limiter.Allow() busy-spin is gone).
  • Bounded in-flight + drop-and-count: a semaphore caps true unacked sends; on overflow the overdue tx is dropped and counted — the arrival clock is never throttled by backpressure. The permit is released at real send completion (a LoadTx.OnComplete hook fired by the worker after sendTransaction), so dropped measures genuine load-shed, not buffer geometry.
  • One rate authority: λ comes from the ramper's shared rate.Limiter. Behind a config flag (--arrival-model), with the legacy closed-loop path retained as the regression baseline. arrival_model recorded.

Review (systems + measurement + idiom, two rounds)

The loop caught and fixed two blocking issues before merge:

  • B1: open-loop with the default TPS=0rate.Infgap=0 → a degenerate constant latency anchor. Now rejected at config validation (open-loop requires finite positive λ: TPS>0 or a ramp).
  • B2: the in-flight semaphore originally released at enqueue, so dropped measured buffer geometry and a synchronous test masked it. Fixed to release at real send completion; test now uses an async sender + a direct PermitHeldUntilCompletion guard. Verified: concurrency-correct (sync.Once, happens-before intact, -race -count=20 clean), conservation issued == sent + dropped holds.

schedule_lag (PLT-463) remains the primary CO-detection gate. Forward-note: inclusion-rate denominators must use sent, never issued.

Decision brief: designs/sei-load-workload-modeler/PLT-458-open-loop-scheduler.md.

🤖 Generated with Claude Code

bdchatham and others added 2 commits June 12, 2026 07:58
Make transaction arrival open-loop to fix coordinated omission: tx i is
issued at t₀ + i/λ independent of in-flight completion, so a slow SUT no
longer slows the generator and hides backlog in latency.

- sender/scheduler.go: openLoopScheduler owns t₀ and the monotonic
  sequence index i, derives λ from the shared rate.Limiter as a clock
  source (sampled per tick to honor a ramping λ; telescopes to t₀ + i/λ
  at fixed λ), and stamps IntendedSendTime at the true scheduled instant.
- Overflow is bounded-in-flight + drop-and-count: a non-blocking
  semaphore TryAcquire admits the tx or drops-and-counts it; the arrival
  clock is never blocked on capacity (REL8/REL9 load shedding).
- One rate authority preserved: the ramper still drives λ via
  limiter.SetLimit; the worker's busy-spin Allow() gate is replaced by a
  blocking Wait (closed-loop only) and disabled under open-loop.
- Behind config flag arrival-model (default closed_loop, the regression
  baseline) + max-in-flight; arrival_model and run_txs_dropped_total are
  recorded at run end.
- types.LoadTx gains SequenceIndex (scheduler-owned, single-write per the
  documented concurrency contract) for PLT-463 schedule-lag attribution;
  dropped txs carry zero InclusionTime and are kept out of inclusion-rate
  denominators.

Tests: schedule-accuracy (tracks t₀ + i/λ within tolerance), clock not
throttled by a slow sender (overrun dropped not blocked), ramped-λ gap
shrink, and stamp-before-handoff under -race. go build + go test -race
green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bound

B1: reject open_loop without a finite positive arrival rate. With TPS=0 and
no ramp, λ=rate.Inf, the inter-arrival gap collapses to 0, IntendedSendTime
never advances past t₀, and the scheduler spins and drops everything. Add
config.Settings.Validate (TPS>0 or --ramp-up required for open_loop) and call
it after ResolveSettings; fail fast with a clear error. minScheduleRate stays
as divide-by-zero/+Inf defense-in-depth only.

B2: tie the in-flight permit to real send completion, not enqueue. Worker.Send
(via ShardedSender) returns at enqueue, so the prior defer release() bounded
enqueue backlog, not unacked sends — and dropped reflected buffer geometry.
Thread a LoadTx.OnComplete hook the worker invokes after sendTransaction; the
scheduler stamps it to release the permit so maxInFlight bounds true in-flight
and dropped measures genuine load-shed. Enqueue-failure path completes inline.
schedule_lag (PLT-463) remains the primary CO-detection gate.

Also: reject unrecognized --arrival-model at config load; drop the
SequenceIndex self-disambiguation claim (gate on run-level arrival_model);
state the dropped-tx inclusion-denominator invariant plainly; fix relase typo.

Tests: replace the synchronous fakeSender with an async enqueue-and-complete
sender so the slow-sender drop test exercises production semantics; add the
permit-held-until-completion guard, a conservation invariant, and config
validation rejection tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes the core dispatch/send pipeline and how latency anchors are stamped; default closed-loop limits blast radius, but open-loop alters measurement semantics and goodput accounting for anyone enabling it.

Overview
Adds a selectable transaction arrival model (--arrival-model, --max-in-flight) so load can follow a fixed schedule independent of sender backlog, addressing coordinated omission in latency measurement. Closed-loop remains the default regression baseline.

Open-loop path: a new scheduler issues arrivals at t₀ + i/λ (absolute sleep, λ from the shared limiter/ramper), stamps IntendedSendTime and SequenceIndex, caps true in-flight work with a non-blocking semaphore, and drops + counts ticks when saturated instead of slowing the clock. Permits release only after the worker finishes the RPC via LoadTx.OnComplete. Admit-before-generate keeps dropped ticks from advancing the seeded generator.

Wiring: Settings.Validate() rejects invalid models and open-loop without finite λ (TPS>0 or ramp). Workers skip limiter gating in open-loop; prewarm self-paces off the shared limiter. Run shutdown exports arrival_model, dropped, and failed gauges on the run summary.

Large scheduler/unit and real-worker test coverage exercises schedule accuracy, conservation, determinism under drops, permit lifecycle, and prewarm pacing.

Reviewed by Cursor Bugbot for commit 5399f90. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread sender/scheduler.go
Omit the redundant time.Duration type from the minGap declaration in
scheduler_test (inferred from time.Hour). Caught by golangci-lint (CI gate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-loop-scheduler

# Conflicts:
#	main.go
#	sender/sharded_sender.go
#	sender/worker.go
Comment thread sender/dispatcher.go
bdchatham and others added 2 commits June 12, 2026 11:47
Every existing scheduler test drove a fake TxSender that fired tx.OnComplete
itself, so the suite stayed green even if the real Worker forgot to invoke it
in runTxSender — leaking the open-loop in-flight semaphore (permits never
released → the maxInFlight bound becomes meaningless).

Add an httptest JSON-RPC harness (answers eth_sendRawTransaction, the only RPC
the ethclient send path issues) behind the real Worker + open-loop scheduler:

- Conservation on the real path: issued == completed + dropped, with completed
  driven by the real worker's OnComplete; handled sends matched against the
  real RPC server's count.
- Permit released by the worker: maxInFlight=1 with a server that blocks one
  send holds exactly one in flight and drops the rest; releasing it resumes
  flow — which is only possible if the worker fires OnComplete.

Both tests fail when the OnComplete invoke is removed (verified). Also clarify
the scheduler doc: enqueue is async but the RPC send is synchronous, so the
permit is held for the full round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conservation test sampled `succeeded == handled` against an in-flight
window: the httptest server bumps `handled` on receiving
eth_sendRawTransaction, but the worker bumps `succeeded` only after
SendTransaction returns and OnComplete fires. CI's slower scheduling caught
that window (handled=200, succeeded=199).

The dominant cause was teardown ordering, not pure sampling: the scheduler ran
as the scope's main task, so the instant it exhausted the generator and
returned, service.Run canceled the worker's context — aborting the last send
whose 200 OK the server had already counted. That send completed with
context-canceled, so completed++ but not succeeded++.

Fix: run the scheduler and worker as background tasks behind a main gate that
blocks until the test tears down, so the scope stays alive until quiescence.
Anchor the assertion on the fixed total and require exhaustion, conservation,
and equality together in one predicate, evaluated only once they all hold (a
stable fixpoint, since the counters are monotonic and no new work is issued
after exhaustion). Correctness depends on convergence, not the deadline.

Verified: go test -race -count=50 and -count=20 -cpu=1,2,4 green; GOMAXPROCS=1
and =2 green. Falsification holds — commenting out the OnComplete invoke in
runTxSender fails both tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread sender/sharded_sender.go
bdchatham and others added 2 commits June 12, 2026 13:33
Move the dense coordinated-omission / arrival-model narrative out of
scheduler.go into a new sender/doc.go package doc, and lean the inline
comments to terse pointers. No behavior change.

The load-bearing inline notes stay (leaned) at the code they guard: the
worker's OnComplete permit-release and the single-writer stamp-before-
hand-off contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bugbot finding 1: the scheduler called Generate() before TryAcquire, so a
dropped tick still advanced the seeded generator streams and signed a tx that
was then discarded. Under saturation the same seed produced a different admitted
workload depending on SUT speed, violating the per-stream reproducibility
contract (PLT-456). Reorder so the in-flight permit is acquired first; a dropped
tick now consumes zero draws and zero signing CPU, making admitted txs a
deterministic prefix of the seeded sequence. The arrival clock stays
non-blocking/unthrottled. Add Admitted() so conservation anchors on
scheduled-arrival counts (admitted + dropped), not generator-draw counts, since
draws now occur only on admitted ticks. SequenceIndex remains the arrival-tick
index i (monotonic, non-contiguous under drops) so IntendedSendTime = t0 + i/λ
holds. New determinism guard test asserts admitted draws are a gapless prefix
under forced drops.

Bugbot finding 2: an admitted send that failed in the worker was logged but
counted as neither sent nor dropped, breaking issued == sent + dropped under RPC
failures. Add a failed counter; onSent increments succeeded on nil err and
failed on non-nil err (both release the permit). Thread failed into the run
summary (new run_txs_failed_total gauge) for goodput accounting. Conservation
invariant documented and tested: scheduled == dropped + admitted, and
admitted == succeeded + failed; the conservation test injects a failing send and
asserts it is counted as failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fe8df95. Configure here.

Comment thread sender/scheduler_test.go Outdated
… vocabulary

Trim the dropped/admitted field-comment block and DispatcherStats/RunSummary
comments to at-site facts plus a pointer to doc.go's Conservation section, which
owns the invariant. Unify prose to scheduled = dropped + admitted;
admitted = succeeded + failed. Make Admitted()'s doc honest: test/audit only,
no production consumer. Comment-only; no identifier or metric renames.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

Both Cursor Bugbot findings addressed (commit fe8df95, polish in 30c2578):

1. Drop path still generates transactions — fixed. The scheduler tick now acquires the in-flight permit (TryAcquire) before Generate(). On a drop it counts and continues with no generation, no signing, and no seeded-stream draw — so dropped slots no longer consume PLT-456 sub-stream draws (which would otherwise make the same seed produce a different workload under saturation, breaking replay). The arrival clock stays unthrottled (TryAcquire is non-blocking). New TestOpenLoopSchedule_DroppedSlotsConsumeNoDraws pins it: admitted draws form the gapless prefix 0..N-1 under forced saturation.

2. Failed sends break issued accounting — fixed. Added a failed counter; onSent increments succeeded on nil error and failed on non-nil, both releasing the permit. Threaded into RunSummary.Failed + run_txs_failed_total. Documented invariant: scheduled == dropped + admitted; admitted == succeeded + failed. TestOpenLoopSchedule_Conservation now injects failing sends and asserts they're counted as failed, not lost.

Both verified under go test -race -count=20; systems + idiom re-review clean.

@bdchatham bdchatham requested review from amir-deris and masih June 12, 2026 21:33
In open-loop the workers are constructed RateLimited=false because the
scheduler owns the arrival clock, but the scheduler paces only the main
load. Prewarm runs first over those same ungated workers, so it flooded
the SUT unthrottled. Pace prewarm off the shared limiter in the
dispatcher: closed-loop leaves d.limiter nil and the worker still gates,
so neither phase is double-throttled and the main loop stays
scheduler-paced.

Correct the conservation test to the two-level invariant from doc.go:
scheduled == dropped + admitted (using the scheduler counters, since a
dropped tick no longer draws Generate) and admitted == succeeded +
failed. Add a real-worker prewarm-rate-limit guard.

Trim worker.go/dispatcher.go inline narration that doc.go owns, keeping
the load-bearing notes (OnComplete CRITICAL, single-writer stamp,
per-iteration context cancel, conservation counters).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant