Skip to content

Latest commit

 

History

History
202 lines (158 loc) · 12.2 KB

File metadata and controls

202 lines (158 loc) · 12.2 KB

TurtleCode — Production Architecture

A CodeRabbit-style automated PR review service. Small scale, production-grade.

Design premise. The only "small" component is the LLM (a weaker local model such as codellama via Ollama). Every other concern — delivery guarantees, idempotency, retries, timeouts, concurrency, observability — is held to production standard and designed to scale. The weak model is compensated for with guardrails, never by lowering infrastructure quality.


1. Goals & constraints

Functional

  • Receive GitHub PR webhook events, produce a review, post it back to the PR.
  • Reviews are grounded in the diff itself, plus structural context (callers, types) when a finding needs it — fetched via the language server / code search, not semantic retrieval.

Non-functional (the actual bar)

  • At-least-once processing with no silent job loss.
  • Idempotent: redelivery or retry never produces duplicate/contradictory comments.
  • Bounded latency per stage; no unbounded hangs.
  • Horizontally scalable workers; the LLM/GPU is the known bottleneck and is explicitly rate-limited.
  • Observable: every event traceable end-to-end by a stable key.

Explicit non-goals (v1)

  • Multi-tenant billing, a web UI, fine-tuning. (Architecture leaves room; not built now.)

2. High-level architecture

                    GitHub
                      │  PR event (webhook, HMAC-signed)
                      ▼
        ┌──────────────────────────┐
        │  Ingress (Express)       │   verify signature → normalize → enqueue
        │  POST /webhook           │   returns 202 immediately (does NOT wait)
        └────────────┬─────────────┘
                     │ enqueue(jobId = delivery-id)
                     ▼
        ┌──────────────────────────┐
        │  Queue (Redis + BullMQ)  │   persistent, reserve→ack, dedup by jobId
        │  + Dead-Letter Queue     │
        └────────────┬─────────────┘
                     │ reserve job (visibility timeout)
                     ▼
        ┌──────────────────────────┐
        │  Worker pool (N procs)   │   bounded concurrency + LLM rate limiter
        │  ┌────────────────────┐  │
        │  │ 1. Fetch diff      │  │  Octokit (real diff, head SHA)
        │  │ 2. Chunk (AST)     │  │  tree-sitter → per-hunk units
        │  │ 3. LLM review      │  │  Ollama, timeout+abort, structured output
        │  │ 4. Validate/guard  │  │  schema + hallucination + confidence gate
        │  │ 5. Post (upsert)   │  │  Octokit, idempotent comment
        │  └────────────────────┘  │
        └────────────┬─────────────┘
                     │ ack on success │ nack→retry/backoff │ exhausted→DLQ
                     ▼
        ┌──────────────────────────┐
        │  State store (Redis/DB)  │   processed-keys, comment refs, metrics
        └──────────────────────────┘

  Cross-cutting: structured logs · metrics · traces · alerts (all keyed by delivery-id)

3. Components

3.1 Ingress (Express)

  • Verify first. Reject any request whose X-Hub-Signature-256 HMAC doesn't match the shared secret — before parsing or enqueuing. Raw body is needed for the HMAC, so capture it in the body parser.
  • Normalize the payload into an internal job: { deliveryId, repo, prNumber, headSha, installationId }. It does not fetch the diff here (keeps ingress fast and cheap).
  • Enqueue and return 202 Accepted immediately. The HTTP response is fully decoupled from review work — GitHub sees millisecond responses regardless of model latency.
  • Filter events early: only pull_request with action in {opened, synchronize, reopened}; ack-and-ignore the rest so they don't occupy the queue.

3.2 Queue (Redis + BullMQ)

Replaces the in-memory SimpleQueue. Provides the properties the demo queue can't:

  • Persistence — survives restarts; nothing lost in memory.
  • Reserve → ack/nack with a visibility timeout — a job is invisible while a worker holds it and returns to the queue automatically if the worker dies mid-job.
  • DedupjobId = deliveryId drops a duplicate already queued.
  • Backoff + attempts — native retry policy.
  • DLQ — failed-past-max jobs retained for inspection/replay, never dropped.

3.3 Worker pool

Stateless processes (scale by adding more). Each pulls jobs and runs the five-step pipeline below. Concurrency is bounded per worker, and a global LLM rate limiter (token-bucket in Redis) caps concurrent model calls across the whole fleet — because the GPU, not CPU, is the scarce resource.

3.4 Context (optional, structural)

The diff is usually enough. When a finding needs facts from outside the hunk (a changed symbol's callers or type), fetch them structurally — language server / grep / code search — because that context is exact. No vector DB: semantic search returns plausible-but-wrong neighbors for code, and at this scale there is no large review-history corpus to make fuzzy recall worthwhile. Any such lookup is degradable — proceed without it and lower confidence rather than failing the job.

3.5 LLM agent core (Ollama)

The weak model, wrapped in guardrails — see §6. Every call is time-boxed and abortable.

3.6 Output (Octokit)

Posts the review. Idempotent by construction: one bot comment per PR, keyed on head SHA, upserted (update-in-place) rather than created — so retries and redeliveries converge to one comment, not a pile.

3.7 State store

Redis (or Postgres if we want durability/history): processed-keys for idempotency, mapping of (repo, pr) → comment id, and counters/metrics.


4. Lifecycle of one review

  1. Webhook arrives → HMAC verified → normalized → enqueue(jobId=deliveryId)202.
  2. Worker reserves the job (visibility timeout starts).
  3. Staleness check: if headSha no longer matches the PR's current head, drop — a newer event supersedes this one.
  4. Fetch diff (Octokit) → chunk with tree-sitter into per-function/per-hunk units.
  5. For each chunk: LLM review (timeboxed) → validate schema → guardrails.
  6. Aggregate valid findings → upsert the single PR comment.
  7. Record processed-key (repo:pr:sha)ack.
  8. On failure: nack → BullMQ backoff-retries; on attempt exhaustion → DLQ + alert.

5. Reliability

Concern Approach
Delivery Webhooks are at-least-once; treat everything as at-least-once end to end.
Job durability Persistent queue + reserve→ack. Crash mid-job → visibility timeout returns it. Never pop-then-forget.
Idempotency Stable key (deliveryId, and repo:pr:sha for the side effect). Comment is upserted. Processed-keys short-circuit re-runs.
Retries Exponential backoff + jitter, capped attempts. Classify errors: transient (network/5xx/timeout/bad-LLM-output) → retry; permanent (malformed payload/4xx) → straight to DLQ.
Timeouts Per-LLM-call timeout via AbortController (actually cancels the request). Per-job timeout at the queue reclaims stuck jobs.
Poison messages Attempt cap → DLQ, so one bad job can't retry forever or block the fleet.
Ordering / supersession Key work by PR; a new headSha supersedes in-flight stale jobs (staleness check in step 3). Latest commit wins.
Partial failure Per-step fatal/degradable policy: structural-context lookup degrades; diff-fetch/post failure retries; validation failure retries the LLM step.
Backpressure Bounded queue depth + concurrency; shed or defer when saturated rather than melting Ollama.

6. Compensating for a weak LLM

A less capable model fails by producing confidently wrong, malformed, or off-format output rather than crashing. The production work is containment:

  • Force structure. Constrain output to a strict JSON schema ({severity, file, line, comment}). Validate it; a schema violation is a retryable failure routed through the same retry machinery as a network error.
  • Shrink the task. Weak models do far better on small, scoped prompts. tree-sitter chunks the diff into single functions/hunks so the model reviews one unit at a time instead of a giant diff.
  • Ground it. Low temperature reduces variance; when needed, exact structural context (not fuzzy retrieval) anchors the model to real symbols.
  • Verify claims. Reject findings that cite a line not present in the diff (cheap anti-hallucination check).
  • Confidence gate. Low-confidence or unverifiable findings are suppressed or flagged for human review rather than posted as authoritative.
  • Determinism for idempotency. Key generation on (model, prompt-hash) so retries/redeliveries don't regenerate divergent reviews.

7. Security

  • HMAC signature verification on every webhook (reject before processing).
  • Secrets (GitHub App key, webhook secret, Redis creds) from a secret manager / env, never in code.
  • Least-privilege GitHub App permissions (read diff, write PR comments only).
  • Treat diff content as untrusted input to the prompt; guard against prompt-injection in code/comments (don't let repo content override system instructions).

8. Observability

  • Structured logs keyed by deliveryId for every stage transition (ingress → queued → reserved → llm → posted → acked / → dlq).
  • Metrics: queue depth, in-flight count, retry rate, DLQ size, LLM latency (p50/p95), tokens/cost per job, end-to-end review latency.
  • Traces across the five pipeline steps.
  • Alerts: DLQ growth, queue-depth ceiling, LLM error/timeout rate, Ollama unavailability.

9. Scaling model

  • Bottleneck is the LLM/GPU, not the web tier. Scale by adding workers up to the rate limiter's ceiling; beyond that, add model capacity.
  • Web tier: stateless, scales trivially behind a load balancer.
  • Queue: Redis handles far more than "small scale" needs; partition by repo/installation later if required.
  • Workers: stateless and horizontally scalable; concurrency bounded per worker, globally capped by the shared token-bucket rate limiter.

10. Technology choices

Layer Choice Why
Ingress Express Already in use; minimal.
Queue Redis + BullMQ Persistence, reserve→ack, backoff, DLQ, dedup out of the box.
LLM Ollama (codellama) Local, cheap, self-hosted; weak → needs guardrails.
Parsing tree-sitter Language-aware chunking for small, scoped prompts.
Context language server / grep / code search Exact structural lookups; no vector DB.
GitHub Octokit (GitHub App) Real diffs + idempotent comment upsert.
State Redis (→ Postgres if history needed) Idempotency keys, comment refs, metrics.

11. Failure-mode summary

Failure Detection Response
Ollama down/slow LLM timeout Retry w/ backoff → DLQ + alert if sustained
Malformed webhook Schema/HMAC check 4xx at ingress; no enqueue
Bad LLM output Schema validation Retry LLM step (capped) → drop finding if persistent
Worker crash mid-job Visibility timeout Job returns to queue, another worker retries
Duplicate delivery jobId + processed-key Deduped / no-op
Stale commit headSha mismatch Drop superseded job
Context lookup fails Query error Degrade: proceed without it, lower confidence
Poison job Attempt cap DLQ + alert

12. Open questions

  • Durable history required (audit/replay)? → promotes state store from Redis to Postgres.
  • Single-repo or multi-installation from day one? → affects rate-limiter partitioning.
  • Human-in-the-loop for low-confidence reviews, or auto-post with a confidence label?