TurtleCode — Production Architecture

A CodeRabbit-style automated PR review service. Small scale, production-grade.

Design premise. The only "small" component is the LLM (a weaker local model such as codellama via Ollama). Every other concern — delivery guarantees, idempotency, retries, timeouts, concurrency, observability — is held to production standard and designed to scale. The weak model is compensated for with guardrails, never by lowering infrastructure quality.

1. Goals & constraints

Functional

Receive GitHub PR webhook events, produce a review, post it back to the PR.
Reviews are grounded in the diff itself, plus structural context (callers, types) when a finding needs it — fetched via the language server / code search, not semantic retrieval.

Non-functional (the actual bar)

At-least-once processing with no silent job loss.
Idempotent: redelivery or retry never produces duplicate/contradictory comments.
Bounded latency per stage; no unbounded hangs.
Horizontally scalable workers; the LLM/GPU is the known bottleneck and is explicitly rate-limited.
Observable: every event traceable end-to-end by a stable key.

Explicit non-goals (v1)

Multi-tenant billing, a web UI, fine-tuning. (Architecture leaves room; not built now.)

2. High-level architecture

                    GitHub
                      │  PR event (webhook, HMAC-signed)
                      ▼
        ┌──────────────────────────┐
        │  Ingress (Express)       │   verify signature → normalize → enqueue
        │  POST /webhook           │   returns 202 immediately (does NOT wait)
        └────────────┬─────────────┘
                     │ enqueue(jobId = delivery-id)
                     ▼
        ┌──────────────────────────┐
        │  Queue (Redis + BullMQ)  │   persistent, reserve→ack, dedup by jobId
        │  + Dead-Letter Queue     │
        └────────────┬─────────────┘
                     │ reserve job (visibility timeout)
                     ▼
        ┌──────────────────────────┐
        │  Worker pool (N procs)   │   bounded concurrency + LLM rate limiter
        │  ┌────────────────────┐  │
        │  │ 1. Fetch diff      │  │  Octokit (real diff, head SHA)
        │  │ 2. Chunk (AST)     │  │  tree-sitter → per-hunk units
        │  │ 3. LLM review      │  │  Ollama, timeout+abort, structured output
        │  │ 4. Validate/guard  │  │  schema + hallucination + confidence gate
        │  │ 5. Post (upsert)   │  │  Octokit, idempotent comment
        │  └────────────────────┘  │
        └────────────┬─────────────┘
                     │ ack on success │ nack→retry/backoff │ exhausted→DLQ
                     ▼
        ┌──────────────────────────┐
        │  State store (Redis/DB)  │   processed-keys, comment refs, metrics
        └──────────────────────────┘

  Cross-cutting: structured logs · metrics · traces · alerts (all keyed by delivery-id)

3. Components

3.1 Ingress (Express)

Verify first. Reject any request whose X-Hub-Signature-256 HMAC doesn't match the shared secret — before parsing or enqueuing. Raw body is needed for the HMAC, so capture it in the body parser.
Normalize the payload into an internal job: { deliveryId, repo, prNumber, headSha, installationId }. It does not fetch the diff here (keeps ingress fast and cheap).
Enqueue and return 202 Accepted immediately. The HTTP response is fully decoupled from review work — GitHub sees millisecond responses regardless of model latency.
Filter events early: only pull_request with action in {opened, synchronize, reopened}; ack-and-ignore the rest so they don't occupy the queue.

3.2 Queue (Redis + BullMQ)

Replaces the in-memory SimpleQueue. Provides the properties the demo queue can't:

Persistence — survives restarts; nothing lost in memory.
Reserve → ack/nack with a visibility timeout — a job is invisible while a worker holds it and returns to the queue automatically if the worker dies mid-job.
Dedup — jobId = deliveryId drops a duplicate already queued.
Backoff + attempts — native retry policy.
DLQ — failed-past-max jobs retained for inspection/replay, never dropped.

3.3 Worker pool

Stateless processes (scale by adding more). Each pulls jobs and runs the five-step pipeline below. Concurrency is bounded per worker, and a global LLM rate limiter (token-bucket in Redis) caps concurrent model calls across the whole fleet — because the GPU, not CPU, is the scarce resource.

3.4 Context (optional, structural)

The diff is usually enough. When a finding needs facts from outside the hunk (a changed symbol's callers or type), fetch them structurally — language server / grep / code search — because that context is exact. No vector DB: semantic search returns plausible-but-wrong neighbors for code, and at this scale there is no large review-history corpus to make fuzzy recall worthwhile. Any such lookup is degradable — proceed without it and lower confidence rather than failing the job.

3.5 LLM agent core (Ollama)

The weak model, wrapped in guardrails — see §6. Every call is time-boxed and abortable.

3.6 Output (Octokit)

Posts the review. Idempotent by construction: one bot comment per PR, keyed on head SHA, upserted (update-in-place) rather than created — so retries and redeliveries converge to one comment, not a pile.

3.7 State store

Redis (or Postgres if we want durability/history): processed-keys for idempotency, mapping of (repo, pr) → comment id, and counters/metrics.

4. Lifecycle of one review

Webhook arrives → HMAC verified → normalized → enqueue(jobId=deliveryId) → 202.
Worker reserves the job (visibility timeout starts).
Staleness check: if headSha no longer matches the PR's current head, drop — a newer event supersedes this one.
Fetch diff (Octokit) → chunk with tree-sitter into per-function/per-hunk units.
For each chunk: LLM review (timeboxed) → validate schema → guardrails.
Aggregate valid findings → upsert the single PR comment.
Record processed-key (repo:pr:sha) → ack.
On failure: nack → BullMQ backoff-retries; on attempt exhaustion → DLQ + alert.

5. Reliability

Concern	Approach
Delivery	Webhooks are at-least-once; treat everything as at-least-once end to end.
Job durability	Persistent queue + reserve→ack. Crash mid-job → visibility timeout returns it. Never pop-then-forget.
Idempotency	Stable key (`deliveryId`, and `repo:pr:sha` for the side effect). Comment is upserted. Processed-keys short-circuit re-runs.
Retries	Exponential backoff + jitter, capped attempts. Classify errors: transient (network/5xx/timeout/bad-LLM-output) → retry; permanent (malformed payload/4xx) → straight to DLQ.
Timeouts	Per-LLM-call timeout via `AbortController` (actually cancels the request). Per-job timeout at the queue reclaims stuck jobs.
Poison messages	Attempt cap → DLQ, so one bad job can't retry forever or block the fleet.
Ordering / supersession	Key work by PR; a new `headSha` supersedes in-flight stale jobs (staleness check in step 3). Latest commit wins.
Partial failure	Per-step fatal/degradable policy: structural-context lookup degrades; diff-fetch/post failure retries; validation failure retries the LLM step.
Backpressure	Bounded queue depth + concurrency; shed or defer when saturated rather than melting Ollama.

6. Compensating for a weak LLM

A less capable model fails by producing confidently wrong, malformed, or off-format output rather than crashing. The production work is containment:

Force structure. Constrain output to a strict JSON schema ({severity, file, line, comment}). Validate it; a schema violation is a retryable failure routed through the same retry machinery as a network error.
Shrink the task. Weak models do far better on small, scoped prompts. tree-sitter chunks the diff into single functions/hunks so the model reviews one unit at a time instead of a giant diff.
Ground it. Low temperature reduces variance; when needed, exact structural context (not fuzzy retrieval) anchors the model to real symbols.
Verify claims. Reject findings that cite a line not present in the diff (cheap anti-hallucination check).
Confidence gate. Low-confidence or unverifiable findings are suppressed or flagged for human review rather than posted as authoritative.
Determinism for idempotency. Key generation on (model, prompt-hash) so retries/redeliveries don't regenerate divergent reviews.

7. Security

HMAC signature verification on every webhook (reject before processing).
Secrets (GitHub App key, webhook secret, Redis creds) from a secret manager / env, never in code.
Least-privilege GitHub App permissions (read diff, write PR comments only).
Treat diff content as untrusted input to the prompt; guard against prompt-injection in code/comments (don't let repo content override system instructions).

8. Observability

Structured logs keyed by deliveryId for every stage transition (ingress → queued → reserved → llm → posted → acked / → dlq).
Metrics: queue depth, in-flight count, retry rate, DLQ size, LLM latency (p50/p95), tokens/cost per job, end-to-end review latency.
Traces across the five pipeline steps.
Alerts: DLQ growth, queue-depth ceiling, LLM error/timeout rate, Ollama unavailability.

9. Scaling model

Bottleneck is the LLM/GPU, not the web tier. Scale by adding workers up to the rate limiter's ceiling; beyond that, add model capacity.
Web tier: stateless, scales trivially behind a load balancer.
Queue: Redis handles far more than "small scale" needs; partition by repo/installation later if required.
Workers: stateless and horizontally scalable; concurrency bounded per worker, globally capped by the shared token-bucket rate limiter.

10. Technology choices

Layer	Choice	Why
Ingress	Express	Already in use; minimal.
Queue	Redis + BullMQ	Persistence, reserve→ack, backoff, DLQ, dedup out of the box.
LLM	Ollama (`codellama`)	Local, cheap, self-hosted; weak → needs guardrails.
Parsing	tree-sitter	Language-aware chunking for small, scoped prompts.
Context	language server / grep / code search	Exact structural lookups; no vector DB.
GitHub	Octokit (GitHub App)	Real diffs + idempotent comment upsert.
State	Redis (→ Postgres if history needed)	Idempotency keys, comment refs, metrics.

11. Failure-mode summary

Failure	Detection	Response
Ollama down/slow	LLM timeout	Retry w/ backoff → DLQ + alert if sustained
Malformed webhook	Schema/HMAC check	4xx at ingress; no enqueue
Bad LLM output	Schema validation	Retry LLM step (capped) → drop finding if persistent
Worker crash mid-job	Visibility timeout	Job returns to queue, another worker retries
Duplicate delivery	jobId + processed-key	Deduped / no-op
Stale commit	headSha mismatch	Drop superseded job
Context lookup fails	Query error	Degrade: proceed without it, lower confidence
Poison job	Attempt cap	DLQ + alert

12. Open questions

Durable history required (audit/replay)? → promotes state store from Redis to Postgres.
Single-repo or multi-installation from day one? → affects rate-limiter partitioning.
Human-in-the-loop for low-confidence reviews, or auto-post with a confidence label?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurtleCode — Production Architecture

1. Goals & constraints

2. High-level architecture

3. Components

3.1 Ingress (Express)

3.2 Queue (Redis + BullMQ)

3.3 Worker pool

3.4 Context (optional, structural)

3.5 LLM agent core (Ollama)

3.6 Output (Octokit)

3.7 State store

4. Lifecycle of one review

5. Reliability

6. Compensating for a weak LLM

7. Security

8. Observability

9. Scaling model

10. Technology choices

11. Failure-mode summary

12. Open questions

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

TurtleCode — Production Architecture

1. Goals & constraints

2. High-level architecture

3. Components

3.1 Ingress (Express)

3.2 Queue (Redis + BullMQ)

3.3 Worker pool

3.4 Context (optional, structural)

3.5 LLM agent core (Ollama)

3.6 Output (Octokit)

3.7 State store

4. Lifecycle of one review

5. Reliability

6. Compensating for a weak LLM

7. Security

8. Observability

9. Scaling model

10. Technology choices

11. Failure-mode summary

12. Open questions