post-PR-#91 follow-ups: SQLite cap warning + bash sim script#95
Merged
Conversation
…velope The startup warning printed when DB_DRIVER=sqlite claimed "~5 services, ~1k events/sec sustained" — the pre-PR-#91 limit. After PR #91 the SQLite path auto-flips conn-pool, ingest workers/queue, metric cardinality, severity gate, sampling rate, gRPC stream cap, and FTS5 to defaults that handle the 50-120 service band (verified end-to-end with test/run_simulation.sh in a 10-minute, 7-mock-service chaos run — peak RSS 298 MB on a 4 GB host, no OOM, no panics). The wrong warning was actively misleading: it tells operators the SQLite path is dev-only when the rest of the docs (README "Production sizing", CLAUDE.md "SQLite per-driver defaults", the 2026-05-24 design spec) all point them at the 50-120 service band. New text matches the README "Production sizing" table verbatim: SQLite for 50-120 services on auto-tuned defaults, Postgres beyond.
The PowerShell simulator runs only on Windows / pwsh. CI runners and most Linux dev hosts don't have pwsh installed, which made the "validate the binary under chaos load" workflow Windows-only. test/run_simulation.sh is a faithful port — same 7 mock services on ports 9001-9007, same weighted endpoint mix (orders 6x, payments 2x, inventory 2x, auth 1x, notifications 1x), same per-second stats line shape. Differences: - Per-worker counter files in $TMP_DIR/stats/*.cnt aggregated by the stats loop (vs ps1's locked Synchronized hashtable). Avoids bash shared-state pain at the cost of <1s stat lag. - Honours DURATION_SEC env so it can run a fixed-length validation (e.g. DURATION_SEC=600 for the 10-min pre-release smoke test) on top of the original "run until Ctrl+C" mode. - Trap-driven cleanup kills the 7 service PIDs on EXIT / INT / TERM. Validated by running DURATION_SEC=600 against the freshly-built otelcontext binary: 11,840 chaos requests, 7-service GraphRAG topology built correctly, anomaly detection caught latency + error spikes, all 7 MCP tools returned valid JSON, no leaks.
f3fcbff to
afc0cab
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Two small follow-ups that surfaced when running a pre-release validation
against the merged PR #91 work.
Changes
1. Fix stale SQLite startup warning (
main.go)The warning printed when
DB_DRIVER=sqlitestill claimed"~5 services, ~1k events/sec sustained" — the pre-PR-#91 limit.
After PR #91 + PR #92 the SQLite path is documented (and verified) to
handle ~50-120 services on a 4 GB host. The warning text was actively
misleading operators into thinking SQLite is dev-only.
New text matches the README "Production sizing" table verbatim.
2. Bash port of
run_simulation.ps1(test/run_simulation.sh)The PowerShell simulator runs only on Windows / pwsh. CI runners and
most Linux dev hosts don't have pwsh.
test/run_simulation.shis afaithful port — same 7 mock services on ports 9001-9007, same weighted
endpoint mix, same per-second stats. Adds
DURATION_SECenv knob forfixed-length validation runs.
Validation
I just used the bash script to do the 10-minute chaos run that surfaced
the stale warning:
tools/listget_anomaly_timelineget_service_mapTest plan
gofmt -l main.gocleango vet ./...cleango build ./...clean