I spent three months integrating an LLM-powered code review step into a production CI/CD pipeline. Here's an honest account of what worked, what didn't, and what I'd do differently.

The Setup

The goal was simple: on every pull request, run an automated review that catches common issues before human reviewers spend time on them. Logic errors, security anti-patterns, performance issues, test coverage gaps. The kind of things that junior engineers miss and senior engineers get tired of catching.

What Worked Well

LLMs are surprisingly good at catching certain classes of bugs — off-by-one errors, missing null checks, inconsistent error handling, and insecure string formatting. They also excel at identifying code that "works but is wrong" in subtle ways that static analysis misses.

LLMs catch the bugs that compilers can't, because they reason about intent, not just syntax.

For ML-specific code — which was much of our codebase — the LLM reviewer was particularly sharp. It caught training-serving skew risks, identified places where we were leaking label information into features, and flagged incorrect tensor shape assumptions that would only manifest at runtime.

What Failed

Two failure modes dominated:

What I'd Do Differently

The biggest lesson: treat AI code review as a specialist, not a generalist. Instead of asking the model to review "everything," route specific concern types to it: security, ML-specific patterns, API contracts. Let it be excellent at those, and don't ask it to replace human judgment on architectural decisions.

# Structured review prompt (better than "review this code")
prompt = f"""
You are reviewing ML pipeline code for the following specific concerns only:
1. Training-serving skew risks
2. Data leakage between train/test splits
3. Incorrect use of random seeds

Code diff:
{diff}

For each concern, respond with: [FOUND / NONE] and brief explanation.
"""

The Right Mental Model

AI code review is best thought of as a fast, tireless, but fallible first pass. It raises flags that a human reviewer then triages. It should reduce cognitive load on reviewers, not replace their judgment. Treat its output like you'd treat a junior colleague's review — valuable, but always verify.

The teams that get the most value from AI-assisted review are the ones who've been precise about what they're asking the AI to do. Specificity beats generality, every time.