OrionAI Build logo orionai.build

I Built a Claude Code Agent That Reviews PRs (Here's the Prompt)

By OrionAI Build Editorial · Published 2026-05-10 · // build

I run a small consulting practice. Most of my client work is async PR review — five to twenty-five PRs a day, across four codebases, none of which I have full context on at any given moment. Last month I built a Claude Code agent that does the first review pass. Two weeks in, it has reviewed 312 PRs, caught 41 real issues, and shipped exactly one false positive that mattered (a security finding I'd have caught manually too). Here's the build.

The shape of the problem

The job isn't "find every bug." It's "spot the things that consistently bite you in this kind of project." Different code bases have different consistent biters. Mine are: missing error paths in async handlers, secrets sneaking into commits, schema migrations without rollback, and load-bearing comments that say "TODO: remove" two years later.

The system prompt

The whole agent is the prompt. Everything else is plumbing.

You are a PR review agent for {{REPO_NAME}}.
Your job: produce ONE markdown comment summarising the PR.
For each finding, output: { severity: low|medium|high|critical, file:line, summary, evidence }.

Review priorities (in order):
  1) security: secrets in commits, dependency CVEs, missing authn checks
  2) data integrity: irreversible migrations without rollback, schema drift
  3) error handling: async handlers without try, raised exceptions without logging
  4) load-bearing comments and TODOs older than 6 months

Refuse to review:
  - PRs > 2000 lines (request a split)
  - PRs touching files outside the standard tree (request human eyes)

Output schema (strict JSON):
{ "summary": str, "findings": [Finding], "blocker": bool }
Set blocker=true ONLY if a critical finding exists.

The wiring

Three pieces:

name: AI PR Review
on: pull_request
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - run: ./scripts/ai_review.sh "$GITHUB_PR_NUMBER"
        env:
          ANTHROPIC_API_KEY: ${{secrets.ANTHROPIC_API_KEY}}

What broke first

The first version had no diff size cap. It happily started reviewing a 4,800-line generated migration and burned $3.40 on the response. Cap added. Second version reviewed a renamed file as if it were new code. Cap on rename heuristic added. Third version flagged the same legitimate TODO twice in two different PRs because it had no memory between runs. Memory added: a small JSON file in the repo, .ai-review-known-todos.json, with TODOs the team has explicitly accepted.

Cost reality

Per PR, the agent spends about $0.06 to $0.12 in input + output tokens (a small frontier model with prompt caching enabled). At 312 PRs the bill came in around $24. The first PR it caught — a leaked dev secret destined for a public artifact — would have cost more than that to clean up by itself.

Where I wouldn't use it

Front-end visual reviews. Architectural reviews. Migrations involving cross-service contracts. Anything where the right move is "go ask three people for context" — that's not a job for any agent yet.

What I'd change next

Right now it reviews from the diff and the changed files. The next move is feeding it the test files associated with the changed code, so it can compare claims-in-the-PR-description against tests-the-PR-actually-changes. That's the next batch of false-negatives I'm chasing.

Stack summary

ComponentChoiceWhy
ModelClaude (frontier tier)Best long-context reasoning on code I've used
RunnerClaude Code CLI in GitHub ActionsSimple shell-out, no SDK indirection
CachingAnthropic prompt cachingCuts repeat-prompt cost by ~85% across PRs in same repo
StateJSON file in repoVersioned, reviewable, no extra DB
Model APIs — vetted picks
GPU & compute — vetted picks
Dev tools — vetted picks