Skip to content

post-PR-#91 follow-ups: SQLite cap warning + bash sim script#95

Merged
aksOps merged 2 commits into
mainfrom
docs/post-7tool-cleanup
May 25, 2026
Merged

post-PR-#91 follow-ups: SQLite cap warning + bash sim script#95
aksOps merged 2 commits into
mainfrom
docs/post-7tool-cleanup

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented May 25, 2026

Summary

Two small follow-ups that surfaced when running a pre-release validation
against the merged PR #91 work.

Changes

1. Fix stale SQLite startup warning (main.go)

The warning printed when DB_DRIVER=sqlite still claimed
"~5 services, ~1k events/sec sustained" — the pre-PR-#91 limit.
After PR #91 + PR #92 the SQLite path is documented (and verified) to
handle ~50-120 services on a 4 GB host. The warning text was actively
misleading operators into thinking SQLite is dev-only.

New text matches the README "Production sizing" table verbatim.

2. Bash port of run_simulation.ps1 (test/run_simulation.sh)

The PowerShell simulator runs only on Windows / pwsh. CI runners and
most Linux dev hosts don't have pwsh. test/run_simulation.sh is a
faithful port — same 7 mock services on ports 9001-9007, same weighted
endpoint mix, same per-second stats. Adds DURATION_SEC env knob for
fixed-length validation runs.

Validation

I just used the bash script to do the 10-minute chaos run that surfaced
the stale warning:

Metric Result
Duration 600s, 10 workers, 10ms delay
Chaos requests 11,840 (27% deliberate fail rate)
OtelContext peak RSS 298 MB (linear growth, no spike)
Recovery after load -18 MB in 60s (GC released)
SQLite DB size 33 MB (spans 12.9, logs 4.3, indices ~10)
Panics / OOMs 0
MCP tools/list 7 tools ✓
get_anomaly_timeline Caught shipping-service latency spike + inventory-service error spike
get_service_map 7-node topology built correctly
Graceful shutdown DB flushed cleanly, WAL collapsed

Test plan

  • gofmt -l main.go clean
  • go vet ./... clean
  • go build ./... clean
  • Bash script chmod +x set
  • Verified end-to-end via the 10-min run referenced above
  • Reviewer: confirm warning text matches README sizing table

aksOps added 2 commits May 25, 2026 12:16
…velope

The startup warning printed when DB_DRIVER=sqlite claimed "~5 services,
~1k events/sec sustained" — the pre-PR-#91 limit. After PR #91 the
SQLite path auto-flips conn-pool, ingest workers/queue, metric
cardinality, severity gate, sampling rate, gRPC stream cap, and FTS5 to
defaults that handle the 50-120 service band (verified end-to-end with
test/run_simulation.sh in a 10-minute, 7-mock-service chaos run — peak
RSS 298 MB on a 4 GB host, no OOM, no panics).

The wrong warning was actively misleading: it tells operators the
SQLite path is dev-only when the rest of the docs (README "Production
sizing", CLAUDE.md "SQLite per-driver defaults", the 2026-05-24 design
spec) all point them at the 50-120 service band.

New text matches the README "Production sizing" table verbatim:
SQLite for 50-120 services on auto-tuned defaults, Postgres beyond.
The PowerShell simulator runs only on Windows / pwsh. CI runners and
most Linux dev hosts don't have pwsh installed, which made the
"validate the binary under chaos load" workflow Windows-only.

test/run_simulation.sh is a faithful port — same 7 mock services on
ports 9001-9007, same weighted endpoint mix (orders 6x, payments 2x,
inventory 2x, auth 1x, notifications 1x), same per-second stats line
shape. Differences:

- Per-worker counter files in $TMP_DIR/stats/*.cnt aggregated by the
  stats loop (vs ps1's locked Synchronized hashtable). Avoids bash
  shared-state pain at the cost of <1s stat lag.
- Honours DURATION_SEC env so it can run a fixed-length validation
  (e.g. DURATION_SEC=600 for the 10-min pre-release smoke test) on top
  of the original "run until Ctrl+C" mode.
- Trap-driven cleanup kills the 7 service PIDs on EXIT / INT / TERM.

Validated by running DURATION_SEC=600 against the freshly-built
otelcontext binary: 11,840 chaos requests, 7-service GraphRAG topology
built correctly, anomaly detection caught latency + error spikes, all
7 MCP tools returned valid JSON, no leaks.
@aksOps aksOps force-pushed the docs/post-7tool-cleanup branch from f3fcbff to afc0cab Compare May 25, 2026 12:16
@sonarqubecloud
Copy link
Copy Markdown

@aksOps aksOps merged commit ce19ec7 into main May 25, 2026
17 checks passed
@aksOps aksOps deleted the docs/post-7tool-cleanup branch May 25, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant