From 1a27106a56aca98fa6f19aef2f04a3834979a37f Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Mon, 25 May 2026 11:43:08 +0000 Subject: [PATCH 1/2] docs: reflect post-PR-#91 state across CHANGELOG + historical plans MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bring documentation that wasn't touched by PR #91 itself into alignment with what actually shipped on 2026-05-24: - CHANGELOG.md: add the PR #91 entry under [Unreleased]. Covers all five change classes (Added: SQLite PRAGMA stanza + per-driver defaults + design spec; Changed: 7-tool MCP surface, LOG_FTS_ENABLED default; Removed: vectordb package + 14 MCP tools + graph_snapshots table + central-ops module dep; Fixed: 120-service OOM; Security: 28 OSV advisories on x/crypto, x/net, x/sys, Go stdlib, brace-expansion). - docs/IMPLEMENTATION_PLAN.md: add a status banner at the top noting the document is historical planning intent. Three specific sections are flagged as superseded — vector layer (5.3), MCP surface (5.4), hot/cold storage tiering (goal #4) — with pointers to the authoritative current sources (CLAUDE.md + 2026-05-24 design spec). No line-by-line rewrite; preserve the original record. - docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md: add a superseded banner. The earlier plan was "drop FTS5, persist vectordb"; PR #91 did the opposite (kept FTS5, removed vectordb). The 24-hour search_logs cap from that plan did ship. Pure docs change. No code touched. --- CHANGELOG.md | 54 +++++++++++++++++++ docs/IMPLEMENTATION_PLAN.md | 31 +++++++++++ .../2026-04-30-storage-rebalance-plan.md | 13 ++++- 3 files changed, 97 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4ec8806..5e2c431 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,17 @@ last published pre-release tag (`v0.0.11-beta.15`). ### Added +- **SQLite survival tuning for 120-service production load** ([#91]): + - Fail-closed PRAGMA stanza in `internal/storage/factory.go` — + `journal_mode=WAL`, `synchronous=NORMAL`, 256 MB page cache, 1 GB + mmap, 64 MB WAL cap, `busy_timeout=5000`. + - Per-driver config defaults — `config.applyDriverDefaults` overrides + 9 tunables (conn pool, ingest workers/queue, metric cardinality, + store-min-severity, sampling rate, gRPC stream cap, `LOG_FTS_ENABLED`) + when `DB_DRIVER=sqlite` and the operator did not set the env var + explicitly. + - Design spec at + [`docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md`](docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md). - **Multi-tenancy across the stack** — tenant context plumbed end-to-end: - GraphRAG: in-memory stores partitioned per tenant + query context propagation. ([#27], RAN-37) @@ -45,14 +56,49 @@ last published pre-release tag (`v0.0.11-beta.15`). ### Changed +- **MCP surface reduced from 21 tools to 7 triage-essential tools** ([#91]). + Kept: `get_anomaly_timeline`, `get_service_map`, `get_service_health`, + `root_cause_analysis`, `impact_analysis`, `trace_graph`, `search_logs`. + Removed clients now receive `unknown tool` RPC errors — see CLAUDE.md + "MCP Server" for the full keep/cut list and rationale. +- **`LOG_FTS_ENABLED` defaults to `true` on SQLite** ([#91]). FTS5 BM25 + ranking became the default log-search backend on the SQLite path, + replacing the vectordb TF-IDF dispatch. Operators who need the ~30% + disk savings can opt out via `LOG_FTS_ENABLED=false` + `POST /api/admin/drop_fts`. - CI: replaced the deleted central-ops reusable workflow with a local `ci.yml` so otelcontext owns its quality gates without relying on an external repo. ([#26]) - Post-robustness follow-ups consolidated as a single chore pass over the 100–200-service work. ([#25]) +### Removed + +- **`internal/vectordb/` package + `find_similar_logs` MCP tool** ([#91]). + The TF-IDF semantic-search index added 163 MB steady-state RAM with a + 290 MB peak during FIFO eviction, plus a snapshot loop and DB + tail-replay goroutine. Log similarity now comes from Drain template + clustering plus FTS5 BM25 ranking. `data/vectordb.snapshot` is left + on disk for operators to delete by hand. +- **`graph_snapshots` table + scheduler** ([#91]). The 15-minute + topology snapshot path (`internal/graphrag/snapshot.go`) wrote ~480 + MB/day to disk for a `get_graph_snapshot` MCP tool nothing in + production ever called. AutoMigrate no longer creates the table on + fresh deploys; existing populated tables are left in place + (`DROP TABLE graph_snapshots; VACUUM;` to reclaim disk). +- **`github.com/RandomCodeSpace/central-ops` Go module dependency** ([#91]). + Replaced two tiny helpers (`pkg/version.Detect`, `pkg/httputil.CORSMiddleware`) + with inline equivalents in `main.go` and `internal/mcp/server.go`. + Build now succeeds against a 100% public module graph. + ### Fixed +- **OOM at ~120 services on SQLite under continuous load** ([#91]). At + default config the binary did not survive an hour: ingest pipeline + queue saturation under SQLite WAL contention pinned 0.5–5 GB of + pending batches, GraphRAG permanent stores grew without TTL, TSDB + ring buffer multi-GB at default cardinality. The 7-tool surface + reduction + SQLite-tuned defaults bring steady-state RSS to ~1.8 GB + with bounded bursts at the 120-service target. - **MCP**: propagate `cfg.DefaultTenant` to the MCP fallback path so tools invoked without an explicit tenant resolve to the configured default rather than failing. ([#33], RAN-22) @@ -65,6 +111,13 @@ last published pre-release tag (`v0.0.11-beta.15`). ### Security +- **OSV-Scanner advisory clean-up** ([#91]): + - `golang.org/x/crypto` v0.50.0 → v0.52.0 (12 advisories: GO-2026-5005..5023, 5033). + - `golang.org/x/net` v0.53.0 → v0.55.0 (6 advisories: GO-2026-5025..5030). + - `golang.org/x/sys` v0.43.0 → v0.45.0 (1 advisory: GO-2026-5024). + - Go stdlib `go.mod` directive 1.25.9 → 1.25.10 (8 stdlib advisories). + - `brace-expansion` 5.0.5 → 5.0.6 via npm overrides + (GHSA-jxxr-4gwj-5jf2, CVSS 6.5). - Adopted the OSS-CLI security stack as the project's continuous supply-chain observability surface (Semgrep + OSV-Scanner + Trivy + Gitleaks + jscpd + anchore SBOM). High/Critical findings are merge @@ -85,3 +138,4 @@ last published pre-release tag (`v0.0.11-beta.15`). [#33]: https://github.com/RandomCodeSpace/otelcontext/pull/33 [#34]: https://github.com/RandomCodeSpace/otelcontext/pull/34 [#47]: https://github.com/RandomCodeSpace/otelcontext/pull/47 +[#91]: https://github.com/RandomCodeSpace/otelcontext/pull/91 diff --git a/docs/IMPLEMENTATION_PLAN.md b/docs/IMPLEMENTATION_PLAN.md index 41723fb..183899d 100644 --- a/docs/IMPLEMENTATION_PLAN.md +++ b/docs/IMPLEMENTATION_PLAN.md @@ -1,5 +1,36 @@ # OtelContext Observability & Performance Improvement Plan +> **Status — historical planning intent (2026-Q1/Q2). The shipped system has diverged.** +> +> This document captures the *original* design exploration that drove the +> early implementation work. Significant pieces of it no longer reflect +> reality after PR #91 (2026-05-24, "7-tool MCP triage surface + SQLite +> survival tuning"): +> +> - **Vector (embedded) layer — removed.** `internal/vectordb/` (TF-IDF +> index, `find_similar_logs` MCP tool, Section 5.3 below) was deleted +> entirely. GraphRAG's `SimilarErrors` was unused and was dropped with +> it. Log similarity now comes from Drain template clustering plus FTS5 +> BM25 ranking — see CLAUDE.md "Storage Architecture". +> - **MCP surface — 11 → 7 tools.** Section 5.4 lists 11 MCP tools as the +> planned surface. The shipped MCP surface is 7: `get_anomaly_timeline`, +> `get_service_map`, `get_service_health`, `root_cause_analysis`, +> `impact_analysis`, `trace_graph`, `search_logs`. Cut: `get_system_graph`, +> `tail_logs`, `get_trace`, `search_traces`, `get_metrics`, +> `get_dashboard_stats`, `get_storage_status`, `find_similar_logs`, +> `get_alerts`, `correlated_signals`, `get_error_chains`, +> `get_investigations`, `get_investigation`, `get_graph_snapshot`, +> `search_cold_archive`. +> - **Hot/cold storage tiering** (goal #4 below, "7 days hot, archive +> older") shipped only the hot side. There is no cold-archive subsystem +> today; retention drops by age via `RetentionScheduler`. +> +> The current authoritative documents are: +> - [`CLAUDE.md`](../CLAUDE.md) — operator/agent SSoT for what runs today. +> - [`docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md`](superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md) — design record for the 7-tool reduction + SQLite tuning. +> +> Read the rest of this file as the original intent, not the current spec. + ## Context OtelContext is a self-hosted OTLP observability platform (Go backend + React frontend) that ingests traces, logs, and metrics via gRPC, stores them in a relational DB (SQLite/MySQL/PostgreSQL/MSSQL), and serves dashboards via HTTP/WebSocket. When connected to active services pushing continuous telemetry, the DB will grow unbounded and performance will degrade. The goal is to: diff --git a/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md b/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md index 72e0624..4b44a36 100644 --- a/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md +++ b/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md @@ -1,6 +1,17 @@ # Plan — Storage rebalance: drop FTS5, persist vectordb, cap log search -**Status:** Approved scope, ready for implementation +> **⚠️ SUPERSEDED on 2026-05-24 by PR #91 — do not implement this plan as written.** +> +> When 120-service production load surfaced OOMs in May 2026, the diagnosis +> in [`2026-05-24-mcp-7tool-sqlite-survival-design.md`](2026-05-24-mcp-7tool-sqlite-survival-design.md) +> reversed the central trade-off here: vectordb was **removed entirely** +> (along with `find_similar_logs`) and FTS5 was **kept** as the default +> SQLite log-search backend. The 24-hour `search_logs` time-window cap +> from this plan did ship and remains in effect. +> +> Preserved verbatim below for historical reference. + +**Status:** Approved 2026-04-30 — superseded 2026-05-24 by PR #91. **Date:** 2026-04-30 **Reviewers:** codex (vectordb persistence design, 2026-04-30); user (scope sign-off, 2026-04-30) From 15a62af709d8a3ddad0e92741c6d847a82c164dc Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Mon, 25 May 2026 11:49:23 +0000 Subject: [PATCH 2/2] =?UTF-8?q?docs:=20revise=20PR=20#92=20=E2=80=94=20dro?= =?UTF-8?q?p=20superpowers=20refs,=20update=20README=20instead?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User policy clarification: agent-generated superpowers/* docs should not ship to git. Revising this PR accordingly: - Revert the banner I added to docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md (file restored to origin/main state in this branch). - Drop the link to docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md from CHANGELOG and IMPLEMENTATION_PLAN. CLAUDE.md is the authoritative pointer. README.md additions to reflect the post-PR-#91 reality: - New "Production sizing" section between "Switching databases" and "OTLP Integration". Three-row table maps workload size to the recommended DB. Notes the auto-flipped SQLite defaults and the OTELCONTEXT_ALLOW_SQLITE_PROD=false guardrail. - Features list expanded to cover what shipped in PR #91 but wasn't yet surfaced to README readers: hybrid ingest backpressure, MCP per-call deadlines / concurrency semaphore / TTL cache / SSE keep-alives, log search (FTS5 default + pg_trgm + 24h cap on search_logs), per-tenant cardinality, auto-tuned SQLite PRAGMA stanza + per-driver defaults, self-instrumentation loopback guard. --- CHANGELOG.md | 5 ++- README.md | 33 +++++++++++++++---- docs/IMPLEMENTATION_PLAN.md | 6 ++-- .../2026-04-30-storage-rebalance-plan.md | 13 +------- 4 files changed, 33 insertions(+), 24 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5e2c431..ebbb48d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,9 +23,8 @@ last published pre-release tag (`v0.0.11-beta.15`). 9 tunables (conn pool, ingest workers/queue, metric cardinality, store-min-severity, sampling rate, gRPC stream cap, `LOG_FTS_ENABLED`) when `DB_DRIVER=sqlite` and the operator did not set the env var - explicitly. - - Design spec at - [`docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md`](docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md). + explicitly. See [`CLAUDE.md`](CLAUDE.md) "SQLite per-driver defaults" + for the full table. - **Multi-tenancy across the stack** — tenant context plumbed end-to-end: - GraphRAG: in-memory stores partitioned per tenant + query context propagation. ([#27], RAN-37) diff --git a/README.md b/README.md index dc1053e..0e8d9bb 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,25 @@ DB_DSN="root:password@tcp(localhost:3306)/otelcontext?charset=utf8mb4&parseTime= See [`.env.example`](.env.example) for SQL Server and Azure Entra (passwordless Postgres) configurations. +## Production sizing + +OtelContext auto-tunes itself by driver. The numbers below assume the +auto-flipped SQLite defaults (5% sampling baseline, `STORE_MIN_SEVERITY=WARN`, +3k metric cardinality cap, FTS5 enabled, 1 SQLite writer with WAL + 256 MB +page cache + 1 GB mmap). Postgres keeps the looser defaults. + +| Workload | DB | Steady RSS | Notes | +|---|---|---|---| +| Dev / <10 services | SQLite | <500 MB | Default config; no tuning needed. | +| 50–120 services, 7-day retention | **SQLite (auto-tuned)** or Postgres | ~1.8 GB | SQLite survives this band on the auto-flipped defaults. | +| >120 services, or >7-day retention, or sustained 50+ writes/sec | **Postgres** | depends on host | SQLite's single-writer serialization becomes the bottleneck. | + +`OTELCONTEXT_ALLOW_SQLITE_PROD=false` is the guardrail — `APP_ENV=production` with `DB_DRIVER=sqlite` refuses to start unless the operator opts in. + +See [`CLAUDE.md`](CLAUDE.md) "SQLite per-driver defaults" for the full +table of which env vars get auto-overridden on SQLite, and the rationale +per entry. + ## OTLP Integration OtelContext accepts OTLP gRPC on `:4317` and OTLP HTTP on `:8080/v1/{traces,logs,metrics}`. Point any OpenTelemetry Collector (or SDK) at it: @@ -104,14 +123,16 @@ See `docs/otel-collector-example.yaml` for a complete example. ## Features -- **OTLP gRPC + HTTP ingest** — traces, logs, metrics; gzip and protobuf/JSON supported. -- **GraphRAG** — layered in-memory graph with error-chain, impact, and root-cause queries. +- **OTLP gRPC + HTTP ingest** — traces, logs, metrics; gzip and protobuf/JSON supported. Hybrid backpressure (90% soft-drop, 100% reject) prevents queue OOMs. +- **GraphRAG** — layered in-memory graph with `error_chain`, `impact_analysis`, `root_cause_analysis`, and anomaly-correlation queries. - **Drain log clustering** — deterministic template mining, persisted across restarts. -- **MCP server** — 7-tool triage surface for AI agents over JSON-RPC 2.0 + SSE (get_anomaly_timeline, get_service_map, get_service_health, root_cause_analysis, impact_analysis, trace_graph, search_logs). -- **Multi-tenancy** — per-row `tenant_id`, `X-Tenant-ID` header / `x-tenant-id` gRPC metadata. -- **Adaptive sampling** — always-on for errors and slow spans, probabilistic otherwise. +- **MCP server** — 7-tool triage surface for AI agents over JSON-RPC 2.0 + SSE: `get_anomaly_timeline`, `get_service_map`, `get_service_health`, `root_cause_analysis`, `impact_analysis`, `trace_graph`, `search_logs`. Per-call deadlines, concurrency semaphore, 5 s TTL cache for cheap in-memory tools, SSE keep-alives every 25 s. +- **Log search** — SQLite FTS5 (BM25-ranked) on by default; `pg_trgm` GIN on Postgres; LIKE fallback. `search_logs` is 24-hour-capped to bound the worst-case scan. +- **Multi-tenancy** — per-row `tenant_id`, `X-Tenant-ID` header / `x-tenant-id` gRPC metadata, per-tenant cardinality caps. +- **Adaptive sampling** — always-on for errors and slow spans, probabilistic otherwise (defaults to 5 % on SQLite, 100 % on Postgres). +- **Auto-tuned SQLite path** — fail-closed PRAGMA stanza (WAL, NORMAL sync, 256 MB cache, 1 GB mmap, 64 MB WAL cap) + 9 per-driver config defaults so single-binary deploys survive 120 services on a 4 GB host. - **DLQ** — durable typed envelopes with disk-bounded replay. -- **Self-instrumentation** — export OtelContext's own spans via `OTEL_EXPORTER_OTLP_ENDPOINT`. +- **Self-instrumentation** — export OtelContext's own spans via `OTEL_EXPORTER_OTLP_ENDPOINT`. Loopback guard prevents recursive feedback. ## Security diff --git a/docs/IMPLEMENTATION_PLAN.md b/docs/IMPLEMENTATION_PLAN.md index 183899d..a9acde3 100644 --- a/docs/IMPLEMENTATION_PLAN.md +++ b/docs/IMPLEMENTATION_PLAN.md @@ -25,9 +25,9 @@ > older") shipped only the hot side. There is no cold-archive subsystem > today; retention drops by age via `RetentionScheduler`. > -> The current authoritative documents are: -> - [`CLAUDE.md`](../CLAUDE.md) — operator/agent SSoT for what runs today. -> - [`docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md`](superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md) — design record for the 7-tool reduction + SQLite tuning. +> The current authoritative source is [`CLAUDE.md`](../CLAUDE.md) — the +> operator/agent SSoT for what runs today (architecture, GraphRAG, MCP +> surface, storage, SQLite per-driver defaults). > > Read the rest of this file as the original intent, not the current spec. diff --git a/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md b/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md index 4b44a36..72e0624 100644 --- a/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md +++ b/docs/superpowers/specs/2026-04-30-storage-rebalance-plan.md @@ -1,17 +1,6 @@ # Plan — Storage rebalance: drop FTS5, persist vectordb, cap log search -> **⚠️ SUPERSEDED on 2026-05-24 by PR #91 — do not implement this plan as written.** -> -> When 120-service production load surfaced OOMs in May 2026, the diagnosis -> in [`2026-05-24-mcp-7tool-sqlite-survival-design.md`](2026-05-24-mcp-7tool-sqlite-survival-design.md) -> reversed the central trade-off here: vectordb was **removed entirely** -> (along with `find_similar_logs`) and FTS5 was **kept** as the default -> SQLite log-search backend. The 24-hour `search_logs` time-window cap -> from this plan did ship and remains in effect. -> -> Preserved verbatim below for historical reference. - -**Status:** Approved 2026-04-30 — superseded 2026-05-24 by PR #91. +**Status:** Approved scope, ready for implementation **Date:** 2026-04-30 **Reviewers:** codex (vectordb persistence design, 2026-04-30); user (scope sign-off, 2026-04-30)