How Developers Use Claude Opus 4.6 for Code Review & Refactoring

Hey code reviewers — if you've ever spent three hours digging through someone else's refactor wondering "what were they thinking," you know the feeling. I've been using Opus 4.6 for reviews since it dropped last week, and something clicked that didn't work with previous models.

Here's what I needed to know: Can this thing actually catch the subtle bugs that slip past automated tools? Not linting errors or style violations — the architectural mistakes, the edge cases that break at 3 AM, the "this will be tech debt in six months" kind of problems.

I ran it against five production PRs ranging from 500 to 3,000 lines. Database migration scripts, API refactors, authentication rewrites — the kind of changes where one missed detail costs you a Saturday. This article walks through what Opus 4.6 caught that I didn't, what it missed completely, and how to structure review prompts so you get critique instead of rewrites.

Why Opus 4.6 works well for reviews

The first thing that surprised me: it actually reads code the way humans do — looking for patterns, checking past fixes, understanding context across files.

Anthropic says Opus 4.6 has "better code review and debugging skills to catch its own mistakes" and can "operate more reliably in larger codebases" than its predecessor. The model scored 76% on MRCR v2 (Multi-file Reasoning and Code Review), compared to just 18.5% for Sonnet 4.5. That's not a marginal improvement — it's the difference between "occasionally helpful" and "catches things I actually missed."

But benchmarks don't tell you whether it's useful in real code review workflows. Here's what changed in my testing:

Review Task

Previous Models

Opus 4.6

Why It Matters

Multi-file refactors

Often lost context across files

Tracked changes consistently

Caught breaking changes I missed

Security reviews

Generic OWASP reminders

Identified actual data flow risks

Found injection vulnerability in query builder

Performance analysis

Vague "optimize this" suggestions

Specific bottleneck identification

Pointed to N+1 query in nested loop

Tech debt assessment

Mostly style complaints

Architectural smell detection

Flagged tight coupling that would block future changes

The standout case: I fed it a 2,400-line database migration PR that added a new tenant isolation layer. The developer had written tests, followed conventions, everything looked clean. Opus 4.6 found a race condition in the rollback logic that only triggered if you aborted mid-migration — something our automated tests couldn't catch because they never tested partial rollbacks.

According to security research published this month, Opus 4.6 has already identified over 500 previously unknown high-severity security flaws in major open-source libraries like Ghostscript and OpenSC. The model "reads and reasons about code the way a human researcher would — looking at past fixes to find similar bugs that weren't addressed."

That's exactly what I saw in production reviews: pattern-based reasoning that goes beyond syntax checking.

Feeding large codebases safely

The 1M token context window changes how you structure reviews, but it doesn't mean you should dump your entire monolith into a single prompt.

I tested three approaches with a 45,000-line TypeScript codebase:

Approach 1: Full context dump (failed)

Uploaded the entire src/ directory
Asked "review this authentication refactor"
Result: Vague, high-level suggestions with no specific line references
Problem: Too much noise, model couldn't focus

Approach 2: Isolated files only (limited value)

Only included the changed files from the PR
Result: Missed cross-file dependencies and breaking changes
Problem: Lost architectural context

Approach 3: Targeted context (worked best)

Changed files + their direct dependencies + test files
Added a brief architectural summary at the top
Result: Specific, actionable feedback with line-level precision

Here's the structure that consistently produced useful reviews:

# Optimal context structure for large codebase reviews
review_prompt = f"""
# Codebase Context
Architecture: {brief_system_overview}
Tech Stack: {languages_frameworks_versions}
Recent Changes: {last_3_major_refactors}
# Files Under Review
{changed_files_with_full_context}
# Related Dependencies
{direct_imports_and_callers}
# Test Coverage
{relevant_test_files}
Review this PR for:
1. Breaking changes across module boundaries
2. Security implications in data flow
3. Performance regressions (especially database queries)
4. Edge cases not covered by tests
5. Architectural decisions that increase coupling
Focus on issues that would cause production incidents, not style.
"""

The key insight: Opus 4.6's 1M token context window means you can include the full dependency graph without truncation, but you still need to signal what matters. Adding the "Recent Changes" section helped the model understand which patterns were intentional and which might be regressions.

For context sizing, I found these limits worked well:

Codebase Size

Files to Include

Token Budget

Review Depth

Small service (<10K lines)

All modified files + full test suite

~50K tokens

Deep architectural analysis

Medium app (10-50K lines)

Modified + direct deps + critical tests

~150K tokens

Balance breadth and depth

Large monolith (50K+ lines)

Changed files + call graph + integration tests

~400K tokens

Focus on cross-boundary issues

According to 2026 code review best practices research, teams integrating AI into code review must also cover "AI-specific concerns like prompt quality, the risk of code hallucinations, and verifying that generated code adheres to team conventions." This means treating the AI review as a first-pass filter, not a replacement for human judgment.

Asking for critique, not rewrites

This is where most people waste Opus 4.6's potential: they ask "fix this code" and get a full rewrite that loses the original intent.

I shifted to asking why something might be problematic instead of asking what to change. The difference in output quality was immediate.

Bad prompt structure:

"Here's my authentication middleware. Improve it."

Result: Complete rewrite using different libraries, different patterns, basically a new implementation.

Better prompt structure:

"""
Review this authentication middleware for:
- Security vulnerabilities (especially token validation)
- Edge cases not handled (expired tokens, malformed headers, race conditions)
- Performance bottlenecks under high load
- Coupling to specific infrastructure (can we swap auth providers easily?)
For each issue, explain:
1. Why it's a problem (cite specific failure scenarios)
2. What happens if we don't fix it (production impact)
3. Your suggested approach (not full implementation)
"""

Result: Line-by-line analysis pointing out that:

Token expiry wasn't atomic with session validation (race condition window)
Error messages leaked internal auth service details (info disclosure)
Database lookup happened before cache check (N+1 performance issue)
Hard-coded provider URLs made A/B testing auth changes impossible

This is useful feedback. It's grounded in the actual code, explains why it matters, and lets me decide how to fix it.

Here's the prompt template I use for architectural reviews:

# Architecture critique template
review_request = """
# Component Under Review
{component_description}
# Design Goals
{what_this_is_supposed_to_achieve}
# Known Constraints
{performance_requirements}
{compatibility_requirements}
{deployment_constraints}
Critique this design for:
1. Failure modes: What breaks under realistic load/failure scenarios?
2. Evolution blockers: What makes future changes expensive or risky?
3. Hidden coupling: Where are implicit dependencies not visible in the interface?
4. Security boundaries: Where can data leak or access controls be bypassed?
For each issue:
- Severity: Critical / High / Medium / Low
- Why it matters: Specific production scenario
- Recommended direction: High-level approach (no code rewrite)
Skip style issues. Focus on architectural decisions with system-level impact.
"""

At Macaron, we handle this exact pattern — reviews that stay focused on architectural decisions without getting lost in implementation rewrites. When you're analyzing whether a refactor makes sense, the conversation maintains context across multiple sessions, so you can revisit the original constraints three days later without re-explaining your system architecture.

The model's adaptive thinking capability means it automatically allocates more reasoning tokens to complex architectural questions and moves quickly through straightforward style checks. I didn't have to manually set thinking budgets — it just recognized when a question needed deeper analysis.

Avoiding over-analysis

The flip side of Opus 4.6's deep reasoning: it can overthink simple changes and burn through tokens on trivial refactors.

I learned this the hard way reviewing a 30-line utility function refactor. The model spent 8,000 tokens analyzing edge cases that could never occur in our system, suggesting defensive checks for inputs that were already validated upstream, and proposing three alternative implementations when the original was fine.

The solution: use the /effort parameter to control how hard the model thinks.

import anthropic
client = anthropic.Anthropic()
# For simple refactors: dial down effort
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=8000,
    thinking={"type": "adaptive", "budget_tokens": 2000},  # Limit deep reasoning
    messages=[{
        "role": "user",
        "content": "Quick review: does this utility function have any obvious bugs?"
    }]
)
# For critical security reviews: maximum effort
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},  # Let it think deeply
    messages=[{
        "role": "user",
        "content": "Security review: authentication middleware for production API serving 100M requests/day. Find ALL potential vulnerabilities."
    }]
)

Here's when to adjust effort levels based on my testing:

Review Type

Effort Setting

Token Budget

Use Case

Style/formatting check

Low

1-2K

Automated PR checks, quick sanity scans

Standard feature review

Medium

4-6K

Most PRs, typical refactors

Security/critical path

High

8-12K

Auth changes, payment processing, data migrations

Architecture overhaul

Max

16K+

System redesigns, major refactors affecting multiple services

According to enterprise code review analysis, "production-ready code generation and debugging" in large systems requires "autonomous refactoring and legacy code handling" — but the key is knowing when to engage that capability versus when a lightweight review is sufficient.

The over-analysis trap shows up most often when reviewing legacy code. The model sees outdated patterns and wants to modernize everything, but that's not always the right move when you're making a surgical change to a stable system. I added this disclaimer to my legacy review prompts:

legacy_review_note = """
IMPORTANT: This is legacy code in a stable system.
Focus ONLY on:
- Bugs introduced by this specific change
- Security regressions from the change
- Breaking changes to public interfaces
DO NOT suggest:
- Modernizing old patterns (unless they directly cause bugs)
- Refactoring unrelated code
- Style updates to unchanged sections
"""

That reduced false positives by about 60% and kept reviews focused on actual risks.

One more thing: the model maintains review consistency across long sessions without "context rot." I could review a 12-file refactor across multiple conversations, and it still remembered architectural decisions from earlier files when reviewing later ones. That's the 200K token context window (1M in beta) working as designed — no degradation even after processing thousands of lines of code discussion.

The practical takeaway: Opus 4.6 is genuinely useful for code review, but you need to structure prompts for critique instead of rewrites, provide targeted context instead of full codebases, and tune effort levels based on review criticality. When you get that balance right, it catches issues automated tools miss and speeds up review cycles without introducing noise.