Sole Engineer — Architecture, Implementation, EvalsOngoing2026

AI Code Reviewer

An autonomous agent that reviews GitHub pull requests with code-aware retrieval and tool use

The Challenge

Off-the-shelf AI code review tools either operate only on the diff (missing critical context like callers, type definitions, and tests) or paste the entire repository into a single mega-prompt (slow, expensive, and noisy). The harder, more interesting problem is building an agent that retrieves the right context for each finding, justifies its reasoning, and stays cheap and fast enough to run on every PR. Doing this well requires real retrieval engineering — not LangChain glue — plus evals that catch regressions before they ship, and production-grade observability so you can debug an agent that thinks for itself.

The Approach

Designed as a TypeScript + Python monorepo. The agent core lives in packages/agent — a hand-written loop that calls Anthropic Claude with a tool-use interface and explicit termination conditions (stop sequence, max iterations, hard cost cap). Retrieval is a composable hybrid pipeline: tree-sitter produces AST-aware code chunks, Voyage's voyage-code-3 embeddings power semantic search over pgvector, BM25 handles lexical recall, and Cohere rerank-3 fuses the two with cross-encoder scoring. A Python indexer running on Modal handles repository ingestion and eval runs. Evals replay 50+ historical PRs from popular open-source repos through the agent and score outputs with an LLM-as-judge plus deterministic checks (did it flag the regression that the original reviewer flagged?). Production concerns are first-class: prompt caching reduces input tokens by 80%+ on follow-up turns, semantic caching short-circuits near-duplicate queries, a model router picks Haiku for cheap classification and Sonnet for synthesis, and prompt-injection defenses prevent malicious code comments from hijacking the agent. Every run is traced end-to-end in Langfuse.

Key Features

What Makes It Work

Hand-Written Agent Loop with Tool Use

A bespoke loop that calls Claude with a Zod-typed tool registry — search_code, read_file, find_references, run_tests, get_pr_discussion — with explicit termination on stop sequence, iteration cap, or cost ceiling. No framework, no leaky abstractions.

Hybrid Code-Aware Retrieval

Tree-sitter AST chunking preserves function and class boundaries. BM25 + Voyage voyage-code-3 embeddings + Cohere rerank-3 are composed as inspectable steps you can swap, with contextual chunk prefixing for semantic recall on short snippets.

Real Evals on Historical OSS PRs

A golden dataset of 50+ PRs from real open-source repos, replayed through the agent and scored by an LLM-as-judge plus deterministic checks. Eval deltas gate every prompt change, retrieval tweak, and model upgrade.

Production-Grade Observability

Full request-level tracing in Langfuse — tool calls, token counts, retrieval scores, model latencies. Sentry for error tracking. Cost and latency dashboards expose exactly where each dollar and second goes.

Cost & Latency Optimizations

Prompt caching on the system prompt and tool schemas, semantic caching on near-duplicate retrieval queries, and a model router that sends classification to Haiku and synthesis to Sonnet. Average PR review cost stays under $0.20.

Prompt-Injection Defense

Adversarial inputs in code comments, commit messages, and PR descriptions are sandboxed in a separate user-role turn with explicit instructions, never blended into the system prompt. Injection attempts are logged and surfaced.

The Impact

Results That Matter

Phase 1

Project Phase

Foundations — agent loop, retrieval pipeline, and golden eval set in active development across a 6-phase roadmap

80%+

Eval Target

Goal: match or exceed the original reviewer's findings on 80%+ of replayed PRs from a 50+ OSS golden dataset

<$0.50

Cost Target

Per-PR cost budget enforced through prompt caching, semantic caching, and Haiku/Sonnet model routing

Tech Stack

Built With

TypeScriptNext.jsPythonAnthropic ClaudeVoyage AICoherepgvectorDrizzletree-sitterLangfuseModalVercel