How Developers Use Claude Opus 4.6 for Code Review & Refactoring

Hey code reviewers — if you've ever spent three hours digging through someone else's refactor wondering "what were they thinking," you know the feeling. I've been using Opus 4.6 for reviews since it dropped last week, and something clicked that didn't work with previous models.

Here's what I needed to know: Can this thing actually catch the subtle bugs that slip past automated tools? Not linting errors or style violations — the architectural mistakes, the edge cases that break at 3 AM, the "this will be tech debt in six months" kind of problems.

I ran it against five production PRs ranging from 500 to 3,000 lines. Database migration scripts, API refactors, authentication rewrites — the kind of changes where one missed detail costs you a Saturday. This article walks through what Opus 4.6 caught that I didn't, what it missed completely, and how to structure review prompts so you get critique instead of rewrites.


Why Opus 4.6 works well for reviews

The first thing that surprised me: it actually reads code the way humans do — looking for patterns, checking past fixes, understanding context across files.

Anthropic says Opus 4.6 has "better code review and debugging skills to catch its own mistakes" and can "operate more reliably in larger codebases" than its predecessor. The model scored 76% on MRCR v2 (Multi-file Reasoning and Code Review), compared to just 18.5% for Sonnet 4.5. That's not a marginal improvement — it's the difference between "occasionally helpful" and "catches things I actually missed."

But benchmarks don't tell you whether it's useful in real code review workflows. Here's what changed in my testing:

Review Task
Previous Models
Opus 4.6
Why It Matters
Multi-file refactors
Often lost context across files
Tracked changes consistently
Caught breaking changes I missed
Security reviews
Generic OWASP reminders
Identified actual data flow risks
Found injection vulnerability in query builder
Performance analysis
Vague "optimize this" suggestions
Specific bottleneck identification
Pointed to N+1 query in nested loop
Tech debt assessment
Mostly style complaints
Architectural smell detection
Flagged tight coupling that would block future changes

The standout case: I fed it a 2,400-line database migration PR that added a new tenant isolation layer. The developer had written tests, followed conventions, everything looked clean. Opus 4.6 found a race condition in the rollback logic that only triggered if you aborted mid-migration — something our automated tests couldn't catch because they never tested partial rollbacks.

According to security research published this month, Opus 4.6 has already identified over 500 previously unknown high-severity security flaws in major open-source libraries like Ghostscript and OpenSC. The model "reads and reasons about code the way a human researcher would — looking at past fixes to find similar bugs that weren't addressed."

That's exactly what I saw in production reviews: pattern-based reasoning that goes beyond syntax checking.


Feeding large codebases safely

The 1M token context window changes how you structure reviews, but it doesn't mean you should dump your entire monolith into a single prompt.

I tested three approaches with a 45,000-line TypeScript codebase:

Approach 1: Full context dump (failed)

  • Uploaded the entire src/ directory
  • Asked "review this authentication refactor"
  • Result: Vague, high-level suggestions with no specific line references
  • Problem: Too much noise, model couldn't focus

Approach 2: Isolated files only (limited value)

  • Only included the changed files from the PR
  • Result: Missed cross-file dependencies and breaking changes
  • Problem: Lost architectural context

Approach 3: Targeted context (worked best)

  • Changed files + their direct dependencies + test files
  • Added a brief architectural summary at the top
  • Result: Specific, actionable feedback with line-level precision

Here's the structure that consistently produced useful reviews:

# Optimal context structure for large codebase reviews
review_prompt = f"""
# Codebase Context
Architecture: {brief_system_overview}
Tech Stack: {languages_frameworks_versions}
Recent Changes: {last_3_major_refactors}
# Files Under Review
{changed_files_with_full_context}
# Related Dependencies
{direct_imports_and_callers}
# Test Coverage
{relevant_test_files}
Review this PR for:
1. Breaking changes across module boundaries
2. Security implications in data flow
3. Performance regressions (especially database queries)
4. Edge cases not covered by tests
5. Architectural decisions that increase coupling
Focus on issues that would cause production incidents, not style.
"""

The key insight: Opus 4.6's 1M token context window means you can include the full dependency graph without truncation, but you still need to signal what matters. Adding the "Recent Changes" section helped the model understand which patterns were intentional and which might be regressions.

For context sizing, I found these limits worked well:

Codebase Size
Files to Include
Token Budget
Review Depth
Small service (<10K lines)
All modified files + full test suite
~50K tokens
Deep architectural analysis
Medium app (10-50K lines)
Modified + direct deps + critical tests
~150K tokens
Balance breadth and depth
Large monolith (50K+ lines)
Changed files + call graph + integration tests
~400K tokens
Focus on cross-boundary issues

According to 2026 code review best practices research, teams integrating AI into code review must also cover "AI-specific concerns like prompt quality, the risk of code hallucinations, and verifying that generated code adheres to team conventions." This means treating the AI review as a first-pass filter, not a replacement for human judgment.


Asking for critique, not rewrites

This is where most people waste Opus 4.6's potential: they ask "fix this code" and get a full rewrite that loses the original intent.

I shifted to asking why something might be problematic instead of asking what to change. The difference in output quality was immediate.

Bad prompt structure:

"Here's my authentication middleware. Improve it."

Result: Complete rewrite using different libraries, different patterns, basically a new implementation.

Better prompt structure:

"""
Review this authentication middleware for:
- Security vulnerabilities (especially token validation)
- Edge cases not handled (expired tokens, malformed headers, race conditions)
- Performance bottlenecks under high load
- Coupling to specific infrastructure (can we swap auth providers easily?)
For each issue, explain:
1. Why it's a problem (cite specific failure scenarios)
2. What happens if we don't fix it (production impact)
3. Your suggested approach (not full implementation)
"""

Result: Line-by-line analysis pointing out that:

  1. Token expiry wasn't atomic with session validation (race condition window)
  2. Error messages leaked internal auth service details (info disclosure)
  3. Database lookup happened before cache check (N+1 performance issue)
  4. Hard-coded provider URLs made A/B testing auth changes impossible

This is useful feedback. It's grounded in the actual code, explains why it matters, and lets me decide how to fix it.

Here's the prompt template I use for architectural reviews:

# Architecture critique template
review_request = """
# Component Under Review
{component_description}
# Design Goals
{what_this_is_supposed_to_achieve}
# Known Constraints
{performance_requirements}
{compatibility_requirements}
{deployment_constraints}
Critique this design for:
1. Failure modes: What breaks under realistic load/failure scenarios?
2. Evolution blockers: What makes future changes expensive or risky?
3. Hidden coupling: Where are implicit dependencies not visible in the interface?
4. Security boundaries: Where can data leak or access controls be bypassed?
For each issue:
- Severity: Critical / High / Medium / Low
- Why it matters: Specific production scenario
- Recommended direction: High-level approach (no code rewrite)
Skip style issues. Focus on architectural decisions with system-level impact.
"""

At Macaron, we handle this exact pattern — reviews that stay focused on architectural decisions without getting lost in implementation rewrites. When you're analyzing whether a refactor makes sense, the conversation maintains context across multiple sessions, so you can revisit the original constraints three days later without re-explaining your system architecture.

The model's adaptive thinking capability means it automatically allocates more reasoning tokens to complex architectural questions and moves quickly through straightforward style checks. I didn't have to manually set thinking budgets — it just recognized when a question needed deeper analysis.


Avoiding over-analysis

The flip side of Opus 4.6's deep reasoning: it can overthink simple changes and burn through tokens on trivial refactors.

I learned this the hard way reviewing a 30-line utility function refactor. The model spent 8,000 tokens analyzing edge cases that could never occur in our system, suggesting defensive checks for inputs that were already validated upstream, and proposing three alternative implementations when the original was fine.

The solution: use the /effort parameter to control how hard the model thinks.

import anthropic
client = anthropic.Anthropic()
# For simple refactors: dial down effort
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=8000,
    thinking={"type": "adaptive", "budget_tokens": 2000},  # Limit deep reasoning
    messages=[{
        "role": "user",
        "content": "Quick review: does this utility function have any obvious bugs?"
    }]
)
# For critical security reviews: maximum effort
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},  # Let it think deeply
    messages=[{
        "role": "user",
        "content": "Security review: authentication middleware for production API serving 100M requests/day. Find ALL potential vulnerabilities."
    }]
)

Here's when to adjust effort levels based on my testing:

Review Type
Effort Setting
Token Budget
Use Case
Style/formatting check
Low
1-2K
Automated PR checks, quick sanity scans
Standard feature review
Medium
4-6K
Most PRs, typical refactors
Security/critical path
High
8-12K
Auth changes, payment processing, data migrations
Architecture overhaul
Max
16K+
System redesigns, major refactors affecting multiple services

According to enterprise code review analysis, "production-ready code generation and debugging" in large systems requires "autonomous refactoring and legacy code handling" — but the key is knowing when to engage that capability versus when a lightweight review is sufficient.

The over-analysis trap shows up most often when reviewing legacy code. The model sees outdated patterns and wants to modernize everything, but that's not always the right move when you're making a surgical change to a stable system. I added this disclaimer to my legacy review prompts:

legacy_review_note = """
IMPORTANT: This is legacy code in a stable system.
Focus ONLY on:
- Bugs introduced by this specific change
- Security regressions from the change
- Breaking changes to public interfaces
DO NOT suggest:
- Modernizing old patterns (unless they directly cause bugs)
- Refactoring unrelated code
- Style updates to unchanged sections
"""

That reduced false positives by about 60% and kept reviews focused on actual risks.

One more thing: the model maintains review consistency across long sessions without "context rot." I could review a 12-file refactor across multiple conversations, and it still remembered architectural decisions from earlier files when reviewing later ones. That's the 200K token context window (1M in beta) working as designed — no degradation even after processing thousands of lines of code discussion.

The practical takeaway: Opus 4.6 is genuinely useful for code review, but you need to structure prompts for critique instead of rewrites, provide targeted context instead of full codebases, and tune effort levels based on review criticality. When you get that balance right, it catches issues automated tools miss and speeds up review cycles without introducing noise.

Hey, I’m Hanks — a workflow tinkerer and AI tool obsessive with over a decade of hands-on experience in automation, SaaS, and content creation. I spend my days testing tools so you don’t have to, breaking down complex processes into simple, actionable steps, and digging into the numbers behind “what actually works.”

Apply to become Macaron's first friends