I spent three months integrating an LLM-powered code review step into a production CI/CD pipeline. Here's an honest account of what worked, what didn't, and what I'd do differently.
The Setup
The goal was simple: on every pull request, run an automated review that catches common issues before human reviewers spend time on them. Logic errors, security anti-patterns, performance issues, test coverage gaps. The kind of things that junior engineers miss and senior engineers get tired of catching.
What Worked Well
LLMs are surprisingly good at catching certain classes of bugs — off-by-one errors, missing null checks, inconsistent error handling, and insecure string formatting. They also excel at identifying code that "works but is wrong" in subtle ways that static analysis misses.
LLMs catch the bugs that compilers can't, because they reason about intent, not just syntax.
For ML-specific code — which was much of our codebase — the LLM reviewer was particularly sharp. It caught training-serving skew risks, identified places where we were leaking label information into features, and flagged incorrect tensor shape assumptions that would only manifest at runtime.
What Failed
Two failure modes dominated:
- False positives on domain-specific patterns. The model would flag code as suspicious when it was intentional — optimized loops, unconventional but correct signal processing idioms, deliberate type coercions. This created noise that engineers started ignoring.
- Hallucinated context. When the model didn't have access to the full codebase, it would sometimes suggest changes that were correct in isolation but broke contracts defined elsewhere. Context window limits are real.
What I'd Do Differently
The biggest lesson: treat AI code review as a specialist, not a generalist. Instead of asking the model to review "everything," route specific concern types to it: security, ML-specific patterns, API contracts. Let it be excellent at those, and don't ask it to replace human judgment on architectural decisions.
# Structured review prompt (better than "review this code")
prompt = f"""
You are reviewing ML pipeline code for the following specific concerns only:
1. Training-serving skew risks
2. Data leakage between train/test splits
3. Incorrect use of random seeds
Code diff:
{diff}
For each concern, respond with: [FOUND / NONE] and brief explanation.
"""
The Right Mental Model
AI code review is best thought of as a fast, tireless, but fallible first pass. It raises flags that a human reviewer then triages. It should reduce cognitive load on reviewers, not replace their judgment. Treat its output like you'd treat a junior colleague's review — valuable, but always verify.
The teams that get the most value from AI-assisted review are the ones who've been precise about what they're asking the AI to do. Specificity beats generality, every time.