Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,16 @@ last published pre-release tag (`v0.0.11-beta.15`).

### Added

- **SQLite survival tuning for 120-service production load** ([#91]):
- Fail-closed PRAGMA stanza in `internal/storage/factory.go` —
`journal_mode=WAL`, `synchronous=NORMAL`, 256 MB page cache, 1 GB
mmap, 64 MB WAL cap, `busy_timeout=5000`.
- Per-driver config defaults — `config.applyDriverDefaults` overrides
9 tunables (conn pool, ingest workers/queue, metric cardinality,
store-min-severity, sampling rate, gRPC stream cap, `LOG_FTS_ENABLED`)
when `DB_DRIVER=sqlite` and the operator did not set the env var
explicitly. See [`CLAUDE.md`](CLAUDE.md) "SQLite per-driver defaults"
for the full table.
- **Multi-tenancy across the stack** — tenant context plumbed end-to-end:
- GraphRAG: in-memory stores partitioned per tenant + query context
propagation. ([#27], RAN-37)
Expand Down Expand Up @@ -45,14 +55,49 @@ last published pre-release tag (`v0.0.11-beta.15`).

### Changed

- **MCP surface reduced from 21 tools to 7 triage-essential tools** ([#91]).
Kept: `get_anomaly_timeline`, `get_service_map`, `get_service_health`,
`root_cause_analysis`, `impact_analysis`, `trace_graph`, `search_logs`.
Removed clients now receive `unknown tool` RPC errors — see CLAUDE.md
"MCP Server" for the full keep/cut list and rationale.
- **`LOG_FTS_ENABLED` defaults to `true` on SQLite** ([#91]). FTS5 BM25
ranking became the default log-search backend on the SQLite path,
replacing the vectordb TF-IDF dispatch. Operators who need the ~30%
disk savings can opt out via `LOG_FTS_ENABLED=false` + `POST /api/admin/drop_fts`.
- CI: replaced the deleted central-ops reusable workflow with a local
`ci.yml` so otelcontext owns its quality gates without relying on an
external repo. ([#26])
- Post-robustness follow-ups consolidated as a single chore pass over the
100–200-service work. ([#25])

### Removed

- **`internal/vectordb/` package + `find_similar_logs` MCP tool** ([#91]).
The TF-IDF semantic-search index added 163 MB steady-state RAM with a
290 MB peak during FIFO eviction, plus a snapshot loop and DB
tail-replay goroutine. Log similarity now comes from Drain template
clustering plus FTS5 BM25 ranking. `data/vectordb.snapshot` is left
on disk for operators to delete by hand.
- **`graph_snapshots` table + scheduler** ([#91]). The 15-minute
topology snapshot path (`internal/graphrag/snapshot.go`) wrote ~480
MB/day to disk for a `get_graph_snapshot` MCP tool nothing in
production ever called. AutoMigrate no longer creates the table on
fresh deploys; existing populated tables are left in place
(`DROP TABLE graph_snapshots; VACUUM;` to reclaim disk).
- **`github.com/RandomCodeSpace/central-ops` Go module dependency** ([#91]).
Replaced two tiny helpers (`pkg/version.Detect`, `pkg/httputil.CORSMiddleware`)
with inline equivalents in `main.go` and `internal/mcp/server.go`.
Build now succeeds against a 100% public module graph.

### Fixed

- **OOM at ~120 services on SQLite under continuous load** ([#91]). At
default config the binary did not survive an hour: ingest pipeline
queue saturation under SQLite WAL contention pinned 0.5–5 GB of
pending batches, GraphRAG permanent stores grew without TTL, TSDB
ring buffer multi-GB at default cardinality. The 7-tool surface
reduction + SQLite-tuned defaults bring steady-state RSS to ~1.8 GB
with bounded bursts at the 120-service target.
- **MCP**: propagate `cfg.DefaultTenant` to the MCP fallback path so
tools invoked without an explicit tenant resolve to the configured
default rather than failing. ([#33], RAN-22)
Expand All @@ -65,6 +110,13 @@ last published pre-release tag (`v0.0.11-beta.15`).

### Security

- **OSV-Scanner advisory clean-up** ([#91]):
- `golang.org/x/crypto` v0.50.0 → v0.52.0 (12 advisories: GO-2026-5005..5023, 5033).
- `golang.org/x/net` v0.53.0 → v0.55.0 (6 advisories: GO-2026-5025..5030).
- `golang.org/x/sys` v0.43.0 → v0.45.0 (1 advisory: GO-2026-5024).
- Go stdlib `go.mod` directive 1.25.9 → 1.25.10 (8 stdlib advisories).
- `brace-expansion` 5.0.5 → 5.0.6 via npm overrides
(GHSA-jxxr-4gwj-5jf2, CVSS 6.5).
- Adopted the OSS-CLI security stack as the project's continuous
supply-chain observability surface (Semgrep + OSV-Scanner + Trivy +
Gitleaks + jscpd + anchore SBOM). High/Critical findings are merge
Expand All @@ -85,3 +137,4 @@ last published pre-release tag (`v0.0.11-beta.15`).
[#33]: https://github.com/RandomCodeSpace/otelcontext/pull/33
[#34]: https://github.com/RandomCodeSpace/otelcontext/pull/34
[#47]: https://github.com/RandomCodeSpace/otelcontext/pull/47
[#91]: https://github.com/RandomCodeSpace/otelcontext/pull/91
33 changes: 27 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,25 @@ DB_DSN="root:password@tcp(localhost:3306)/otelcontext?charset=utf8mb4&parseTime=

See [`.env.example`](.env.example) for SQL Server and Azure Entra (passwordless Postgres) configurations.

## Production sizing

OtelContext auto-tunes itself by driver. The numbers below assume the
auto-flipped SQLite defaults (5% sampling baseline, `STORE_MIN_SEVERITY=WARN`,
3k metric cardinality cap, FTS5 enabled, 1 SQLite writer with WAL + 256 MB
page cache + 1 GB mmap). Postgres keeps the looser defaults.

| Workload | DB | Steady RSS | Notes |
|---|---|---|---|
| Dev / <10 services | SQLite | <500 MB | Default config; no tuning needed. |
| 50–120 services, 7-day retention | **SQLite (auto-tuned)** or Postgres | ~1.8 GB | SQLite survives this band on the auto-flipped defaults. |
| >120 services, or >7-day retention, or sustained 50+ writes/sec | **Postgres** | depends on host | SQLite's single-writer serialization becomes the bottleneck. |

`OTELCONTEXT_ALLOW_SQLITE_PROD=false` is the guardrail — `APP_ENV=production` with `DB_DRIVER=sqlite` refuses to start unless the operator opts in.

See [`CLAUDE.md`](CLAUDE.md) "SQLite per-driver defaults" for the full
table of which env vars get auto-overridden on SQLite, and the rationale
per entry.

## OTLP Integration

OtelContext accepts OTLP gRPC on `:4317` and OTLP HTTP on `:8080/v1/{traces,logs,metrics}`. Point any OpenTelemetry Collector (or SDK) at it:
Expand All @@ -104,14 +123,16 @@ See `docs/otel-collector-example.yaml` for a complete example.

## Features

- **OTLP gRPC + HTTP ingest** — traces, logs, metrics; gzip and protobuf/JSON supported.
- **GraphRAG** — layered in-memory graph with error-chain, impact, and root-cause queries.
- **OTLP gRPC + HTTP ingest** — traces, logs, metrics; gzip and protobuf/JSON supported. Hybrid backpressure (90% soft-drop, 100% reject) prevents queue OOMs.
- **GraphRAG** — layered in-memory graph with `error_chain`, `impact_analysis`, `root_cause_analysis`, and anomaly-correlation queries.
- **Drain log clustering** — deterministic template mining, persisted across restarts.
- **MCP server** — 7-tool triage surface for AI agents over JSON-RPC 2.0 + SSE (get_anomaly_timeline, get_service_map, get_service_health, root_cause_analysis, impact_analysis, trace_graph, search_logs).
- **Multi-tenancy** — per-row `tenant_id`, `X-Tenant-ID` header / `x-tenant-id` gRPC metadata.
- **Adaptive sampling** — always-on for errors and slow spans, probabilistic otherwise.
- **MCP server** — 7-tool triage surface for AI agents over JSON-RPC 2.0 + SSE: `get_anomaly_timeline`, `get_service_map`, `get_service_health`, `root_cause_analysis`, `impact_analysis`, `trace_graph`, `search_logs`. Per-call deadlines, concurrency semaphore, 5 s TTL cache for cheap in-memory tools, SSE keep-alives every 25 s.
- **Log search** — SQLite FTS5 (BM25-ranked) on by default; `pg_trgm` GIN on Postgres; LIKE fallback. `search_logs` is 24-hour-capped to bound the worst-case scan.
- **Multi-tenancy** — per-row `tenant_id`, `X-Tenant-ID` header / `x-tenant-id` gRPC metadata, per-tenant cardinality caps.
- **Adaptive sampling** — always-on for errors and slow spans, probabilistic otherwise (defaults to 5 % on SQLite, 100 % on Postgres).
- **Auto-tuned SQLite path** — fail-closed PRAGMA stanza (WAL, NORMAL sync, 256 MB cache, 1 GB mmap, 64 MB WAL cap) + 9 per-driver config defaults so single-binary deploys survive 120 services on a 4 GB host.
- **DLQ** — durable typed envelopes with disk-bounded replay.
- **Self-instrumentation** — export OtelContext's own spans via `OTEL_EXPORTER_OTLP_ENDPOINT`.
- **Self-instrumentation** — export OtelContext's own spans via `OTEL_EXPORTER_OTLP_ENDPOINT`. Loopback guard prevents recursive feedback.

## Security

Expand Down
31 changes: 31 additions & 0 deletions docs/IMPLEMENTATION_PLAN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,36 @@
# OtelContext Observability & Performance Improvement Plan

> **Status — historical planning intent (2026-Q1/Q2). The shipped system has diverged.**
>
> This document captures the *original* design exploration that drove the
> early implementation work. Significant pieces of it no longer reflect
> reality after PR #91 (2026-05-24, "7-tool MCP triage surface + SQLite
> survival tuning"):
>
> - **Vector (embedded) layer — removed.** `internal/vectordb/` (TF-IDF
> index, `find_similar_logs` MCP tool, Section 5.3 below) was deleted
> entirely. GraphRAG's `SimilarErrors` was unused and was dropped with
> it. Log similarity now comes from Drain template clustering plus FTS5
> BM25 ranking — see CLAUDE.md "Storage Architecture".
> - **MCP surface — 11 → 7 tools.** Section 5.4 lists 11 MCP tools as the
> planned surface. The shipped MCP surface is 7: `get_anomaly_timeline`,
> `get_service_map`, `get_service_health`, `root_cause_analysis`,
> `impact_analysis`, `trace_graph`, `search_logs`. Cut: `get_system_graph`,
> `tail_logs`, `get_trace`, `search_traces`, `get_metrics`,
> `get_dashboard_stats`, `get_storage_status`, `find_similar_logs`,
> `get_alerts`, `correlated_signals`, `get_error_chains`,
> `get_investigations`, `get_investigation`, `get_graph_snapshot`,
> `search_cold_archive`.
> - **Hot/cold storage tiering** (goal #4 below, "7 days hot, archive
> older") shipped only the hot side. There is no cold-archive subsystem
> today; retention drops by age via `RetentionScheduler`.
>
> The current authoritative source is [`CLAUDE.md`](../CLAUDE.md) — the
> operator/agent SSoT for what runs today (architecture, GraphRAG, MCP
> surface, storage, SQLite per-driver defaults).
>
> Read the rest of this file as the original intent, not the current spec.

## Context

OtelContext is a self-hosted OTLP observability platform (Go backend + React frontend) that ingests traces, logs, and metrics via gRPC, stores them in a relational DB (SQLite/MySQL/PostgreSQL/MSSQL), and serves dashboards via HTTP/WebSocket. When connected to active services pushing continuous telemetry, the DB will grow unbounded and performance will degrade. The goal is to:
Expand Down