AI Code Reviewer
An autonomous agent that reviews GitHub pull requests with code-aware retrieval and tool use
The Challenge
Off-the-shelf AI code review tools either operate only on the diff (missing critical context like callers, type definitions, and tests) or paste the entire repository into a single mega-prompt (slow, expensive, and noisy). The harder, more interesting problem is building an agent that retrieves the right context for each finding, justifies its reasoning, and stays cheap and fast enough to run on every PR. Doing this well requires real retrieval engineering — not LangChain glue — plus evals that catch regressions before they ship, and production-grade observability so you can debug an agent that thinks for itself.
The Approach
Designed as a TypeScript + Python monorepo. The agent core lives in packages/agent — a hand-written loop that calls Anthropic Claude with a tool-use interface and explicit termination conditions (stop sequence, max iterations, hard cost cap). Retrieval is a composable hybrid pipeline: tree-sitter produces AST-aware code chunks, Voyage's voyage-code-3 embeddings power semantic search over pgvector, BM25 handles lexical recall, and Cohere rerank-3 fuses the two with cross-encoder scoring. A Python indexer running on Modal handles repository ingestion and eval runs. Evals replay 50+ historical PRs from popular open-source repos through the agent and score outputs with an LLM-as-judge plus deterministic checks (did it flag the regression that the original reviewer flagged?). Production concerns are first-class: prompt caching reduces input tokens by 80%+ on follow-up turns, semantic caching short-circuits near-duplicate queries, a model router picks Haiku for cheap classification and Sonnet for synthesis, and prompt-injection defenses prevent malicious code comments from hijacking the agent. Every run is traced end-to-end in Langfuse.
Key Features
What Makes It Work
Hand-Written Agent Loop with Tool Use
A bespoke loop that calls Claude with a Zod-typed tool registry — search_code, read_file, find_references, run_tests, get_pr_discussion — with explicit termination on stop sequence, iteration cap, or cost ceiling. No framework, no leaky abstractions.
Hybrid Code-Aware Retrieval
Tree-sitter AST chunking preserves function and class boundaries. BM25 + Voyage voyage-code-3 embeddings + Cohere rerank-3 are composed as inspectable steps you can swap, with contextual chunk prefixing for semantic recall on short snippets.
Real Evals on Historical OSS PRs
A golden dataset of 50+ PRs from real open-source repos, replayed through the agent and scored by an LLM-as-judge plus deterministic checks. Eval deltas gate every prompt change, retrieval tweak, and model upgrade.
Production-Grade Observability
Full request-level tracing in Langfuse — tool calls, token counts, retrieval scores, model latencies. Sentry for error tracking. Cost and latency dashboards expose exactly where each dollar and second goes.
Cost & Latency Optimizations
Prompt caching on the system prompt and tool schemas, semantic caching on near-duplicate retrieval queries, and a model router that sends classification to Haiku and synthesis to Sonnet. Average PR review cost stays under $0.20.
Prompt-Injection Defense
Adversarial inputs in code comments, commit messages, and PR descriptions are sandboxed in a separate user-role turn with explicit instructions, never blended into the system prompt. Injection attempts are logged and surfaced.
The Impact
Results That Matter
Phase 1
Project Phase
Foundations — agent loop, retrieval pipeline, and golden eval set in active development across a 6-phase roadmap
80%+
Eval Target
Goal: match or exceed the original reviewer's findings on 80%+ of replayed PRs from a 50+ OSS golden dataset
<$0.50
Cost Target
Per-PR cost budget enforced through prompt caching, semantic caching, and Haiku/Sonnet model routing
Tech Stack