A CodeRabbit-style automated PR review service. Small scale, production-grade.
Design premise. The only "small" component is the LLM (a weaker local model such as
codellamavia Ollama). Every other concern — delivery guarantees, idempotency, retries, timeouts, concurrency, observability — is held to production standard and designed to scale. The weak model is compensated for with guardrails, never by lowering infrastructure quality.
Functional
- Receive GitHub PR webhook events, produce a review, post it back to the PR.
- Reviews are grounded in the diff itself, plus structural context (callers, types) when a finding needs it — fetched via the language server / code search, not semantic retrieval.
Non-functional (the actual bar)
- At-least-once processing with no silent job loss.
- Idempotent: redelivery or retry never produces duplicate/contradictory comments.
- Bounded latency per stage; no unbounded hangs.
- Horizontally scalable workers; the LLM/GPU is the known bottleneck and is explicitly rate-limited.
- Observable: every event traceable end-to-end by a stable key.
Explicit non-goals (v1)
- Multi-tenant billing, a web UI, fine-tuning. (Architecture leaves room; not built now.)
GitHub
│ PR event (webhook, HMAC-signed)
▼
┌──────────────────────────┐
│ Ingress (Express) │ verify signature → normalize → enqueue
│ POST /webhook │ returns 202 immediately (does NOT wait)
└────────────┬─────────────┘
│ enqueue(jobId = delivery-id)
▼
┌──────────────────────────┐
│ Queue (Redis + BullMQ) │ persistent, reserve→ack, dedup by jobId
│ + Dead-Letter Queue │
└────────────┬─────────────┘
│ reserve job (visibility timeout)
▼
┌──────────────────────────┐
│ Worker pool (N procs) │ bounded concurrency + LLM rate limiter
│ ┌────────────────────┐ │
│ │ 1. Fetch diff │ │ Octokit (real diff, head SHA)
│ │ 2. Chunk (AST) │ │ tree-sitter → per-hunk units
│ │ 3. LLM review │ │ Ollama, timeout+abort, structured output
│ │ 4. Validate/guard │ │ schema + hallucination + confidence gate
│ │ 5. Post (upsert) │ │ Octokit, idempotent comment
│ └────────────────────┘ │
└────────────┬─────────────┘
│ ack on success │ nack→retry/backoff │ exhausted→DLQ
▼
┌──────────────────────────┐
│ State store (Redis/DB) │ processed-keys, comment refs, metrics
└──────────────────────────┘
Cross-cutting: structured logs · metrics · traces · alerts (all keyed by delivery-id)
- Verify first. Reject any request whose
X-Hub-Signature-256HMAC doesn't match the shared secret — before parsing or enqueuing. Raw body is needed for the HMAC, so capture it in the body parser. - Normalize the payload into an internal job:
{ deliveryId, repo, prNumber, headSha, installationId }. It does not fetch the diff here (keeps ingress fast and cheap). - Enqueue and return
202 Acceptedimmediately. The HTTP response is fully decoupled from review work — GitHub sees millisecond responses regardless of model latency. - Filter events early: only
pull_requestwith action in{opened, synchronize, reopened}; ack-and-ignore the rest so they don't occupy the queue.
Replaces the in-memory SimpleQueue. Provides the properties the demo queue can't:
- Persistence — survives restarts; nothing lost in memory.
- Reserve → ack/nack with a visibility timeout — a job is invisible while a worker holds it and returns to the queue automatically if the worker dies mid-job.
- Dedup —
jobId = deliveryIddrops a duplicate already queued. - Backoff + attempts — native retry policy.
- DLQ — failed-past-max jobs retained for inspection/replay, never dropped.
Stateless processes (scale by adding more). Each pulls jobs and runs the five-step pipeline below. Concurrency is bounded per worker, and a global LLM rate limiter (token-bucket in Redis) caps concurrent model calls across the whole fleet — because the GPU, not CPU, is the scarce resource.
The diff is usually enough. When a finding needs facts from outside the hunk (a changed symbol's callers or type), fetch them structurally — language server / grep / code search — because that context is exact. No vector DB: semantic search returns plausible-but-wrong neighbors for code, and at this scale there is no large review-history corpus to make fuzzy recall worthwhile. Any such lookup is degradable — proceed without it and lower confidence rather than failing the job.
The weak model, wrapped in guardrails — see §6. Every call is time-boxed and abortable.
Posts the review. Idempotent by construction: one bot comment per PR, keyed on head SHA, upserted (update-in-place) rather than created — so retries and redeliveries converge to one comment, not a pile.
Redis (or Postgres if we want durability/history): processed-keys for idempotency, mapping of (repo, pr) → comment id, and counters/metrics.
- Webhook arrives → HMAC verified → normalized →
enqueue(jobId=deliveryId)→202. - Worker reserves the job (visibility timeout starts).
- Staleness check: if
headShano longer matches the PR's current head, drop — a newer event supersedes this one. - Fetch diff (Octokit) → chunk with tree-sitter into per-function/per-hunk units.
- For each chunk: LLM review (timeboxed) → validate schema → guardrails.
- Aggregate valid findings → upsert the single PR comment.
- Record processed-key
(repo:pr:sha)→ ack. - On failure: nack → BullMQ backoff-retries; on attempt exhaustion → DLQ + alert.
| Concern | Approach |
|---|---|
| Delivery | Webhooks are at-least-once; treat everything as at-least-once end to end. |
| Job durability | Persistent queue + reserve→ack. Crash mid-job → visibility timeout returns it. Never pop-then-forget. |
| Idempotency | Stable key (deliveryId, and repo:pr:sha for the side effect). Comment is upserted. Processed-keys short-circuit re-runs. |
| Retries | Exponential backoff + jitter, capped attempts. Classify errors: transient (network/5xx/timeout/bad-LLM-output) → retry; permanent (malformed payload/4xx) → straight to DLQ. |
| Timeouts | Per-LLM-call timeout via AbortController (actually cancels the request). Per-job timeout at the queue reclaims stuck jobs. |
| Poison messages | Attempt cap → DLQ, so one bad job can't retry forever or block the fleet. |
| Ordering / supersession | Key work by PR; a new headSha supersedes in-flight stale jobs (staleness check in step 3). Latest commit wins. |
| Partial failure | Per-step fatal/degradable policy: structural-context lookup degrades; diff-fetch/post failure retries; validation failure retries the LLM step. |
| Backpressure | Bounded queue depth + concurrency; shed or defer when saturated rather than melting Ollama. |
A less capable model fails by producing confidently wrong, malformed, or off-format output rather than crashing. The production work is containment:
- Force structure. Constrain output to a strict JSON schema (
{severity, file, line, comment}). Validate it; a schema violation is a retryable failure routed through the same retry machinery as a network error. - Shrink the task. Weak models do far better on small, scoped prompts. tree-sitter chunks the diff into single functions/hunks so the model reviews one unit at a time instead of a giant diff.
- Ground it. Low temperature reduces variance; when needed, exact structural context (not fuzzy retrieval) anchors the model to real symbols.
- Verify claims. Reject findings that cite a line not present in the diff (cheap anti-hallucination check).
- Confidence gate. Low-confidence or unverifiable findings are suppressed or flagged for human review rather than posted as authoritative.
- Determinism for idempotency. Key generation on
(model, prompt-hash)so retries/redeliveries don't regenerate divergent reviews.
- HMAC signature verification on every webhook (reject before processing).
- Secrets (GitHub App key, webhook secret, Redis creds) from a secret manager / env, never in code.
- Least-privilege GitHub App permissions (read diff, write PR comments only).
- Treat diff content as untrusted input to the prompt; guard against prompt-injection in code/comments (don't let repo content override system instructions).
- Structured logs keyed by
deliveryIdfor every stage transition (ingress → queued → reserved → llm → posted → acked/→ dlq). - Metrics: queue depth, in-flight count, retry rate, DLQ size, LLM latency (p50/p95), tokens/cost per job, end-to-end review latency.
- Traces across the five pipeline steps.
- Alerts: DLQ growth, queue-depth ceiling, LLM error/timeout rate, Ollama unavailability.
- Bottleneck is the LLM/GPU, not the web tier. Scale by adding workers up to the rate limiter's ceiling; beyond that, add model capacity.
- Web tier: stateless, scales trivially behind a load balancer.
- Queue: Redis handles far more than "small scale" needs; partition by repo/installation later if required.
- Workers: stateless and horizontally scalable; concurrency bounded per worker, globally capped by the shared token-bucket rate limiter.
| Layer | Choice | Why |
|---|---|---|
| Ingress | Express | Already in use; minimal. |
| Queue | Redis + BullMQ | Persistence, reserve→ack, backoff, DLQ, dedup out of the box. |
| LLM | Ollama (codellama) |
Local, cheap, self-hosted; weak → needs guardrails. |
| Parsing | tree-sitter | Language-aware chunking for small, scoped prompts. |
| Context | language server / grep / code search | Exact structural lookups; no vector DB. |
| GitHub | Octokit (GitHub App) | Real diffs + idempotent comment upsert. |
| State | Redis (→ Postgres if history needed) | Idempotency keys, comment refs, metrics. |
| Failure | Detection | Response |
|---|---|---|
| Ollama down/slow | LLM timeout | Retry w/ backoff → DLQ + alert if sustained |
| Malformed webhook | Schema/HMAC check | 4xx at ingress; no enqueue |
| Bad LLM output | Schema validation | Retry LLM step (capped) → drop finding if persistent |
| Worker crash mid-job | Visibility timeout | Job returns to queue, another worker retries |
| Duplicate delivery | jobId + processed-key | Deduped / no-op |
| Stale commit | headSha mismatch | Drop superseded job |
| Context lookup fails | Query error | Degrade: proceed without it, lower confidence |
| Poison job | Attempt cap | DLQ + alert |
- Durable history required (audit/replay)? → promotes state store from Redis to Postgres.
- Single-repo or multi-installation from day one? → affects rate-limiter partitioning.
- Human-in-the-loop for low-confidence reviews, or auto-post with a confidence label?