
Hey fellow model-evaluators — if you've been watching the frontier move in slow motion for the past year and then February 2026 happened all at once: same.
Within two weeks, Anthropic shipped Claude Opus 4.6 (February 5), Claude Sonnet 4.6 (February 17), and Google dropped Gemini 3.1 Pro (February 19). I've been running this comparison for the past few days, and the honest answer is more interesting than any single headline number suggests.
I'm Hanks. I test AI tools in real workflows. Here's what the data actually shows — and where each model earns its spot.

No buried conclusion here. If you want the decision framework upfront:
The headline: Gemini 3.1 Pro wins on raw reasoning benchmarks and costs less than half as much. Claude Opus 4.6 wins on the benchmarks that reflect real human work preferences and agentic task reliability. Neither dominates cleanly.
ARC-AGI-2 is the benchmark that's hardest to game. It tests novel pattern recognition — abstract visual puzzles the model hasn't seen in training, where memorization doesn't help. The score measures genuine generalization ability.
Gemini 3.1 Pro: 77.1%. Claude Opus 4.6: 68.8%. That's a genuine 8.3-point gap — and as VentureBeat noted in their first-look coverage, Gemini's score represents more than double its predecessor's performance in a single generation.
For context, both of these scores are remarkable. Opus 4.6 nearly doubled Opus 4.5's score (37.6% → 68.8%) in its own generation — one of the largest single-version reasoning leaps Anthropic has shipped. Gemini then came out two weeks later and leapfrogged it.
What does the ARC-AGI-2 gap mean for real workflows? If your work involves novel multi-step logic problems — complex debugging, scientific analysis, mathematical reasoning, research synthesis — Gemini 3.1 Pro's reasoning advantage is likely to show up. If your work is more pattern-matching on familiar domains (standard software development, document analysis, business writing), the gap probably won't be visible in daily use.
The full reasoning benchmark picture:
One flip worth noting: Humanity's Last Exam with tools flips the result. Claude Opus 4.6 scores 53.1% vs Gemini's 51.4% when both have access to external search. That's a small but real signal about how Opus uses tools in assisted reasoning contexts.

SWE-Bench Verified tests real GitHub issue resolution — writing actual patches across open-source repositories. The scores:
That's a 0.2-point difference. Statistically, this is a tie. As noted by claude5.com's benchmarking analysis, the entire frontier has hit a soft ceiling around 80% on SWE-Bench — all three top models cluster within one percentage point.
Where they diverge is Terminal-Bench 2.0, which tests CLI-native coding tasks. Claude Opus 4.6 scores 65.4% here. Gemini 3.1 Pro doesn't have an official Terminal-Bench score, but third-party evaluations place it lower than Opus on terminal-specific tasks.
For competitive programming and algorithmic coding, the picture flips. Gemini 3.1 Pro's LiveCodeBench Pro Elo is 2887 — significantly ahead of GPT-5.2's 2393. Opus's LiveCodeBench score wasn't independently published in the same format. Gemini appears to lead on purely algorithmic, competitive-style programming problems.
Bottom line on coding: if your workflow is software engineering on real GitHub-style tasks, they're equivalent. If you work heavily in terminal environments, Opus edges ahead. If you're doing competitive programming or algorithmic research, Gemini pulls ahead.

This is the benchmark I keep coming back to. GDPval-AA measures performance on economically valuable knowledge work — tasks across 44 professional occupations including finance, legal, strategy, and business documentation. It correlates more directly with enterprise usefulness than abstract reasoning benchmarks.
That's a 289-point gap — substantial by Elo standards. Anthropic's official release notes confirm Opus 4.6 outperforms GPT-5.2 by 144 Elo points on this metric, making it the clear leader for professional knowledge work.
What drives this gap? Part of it is writing quality and instruction-following. As Interesting Engineering noted in their Arena analysis, Claude Opus 4.6 leads Gemini 3.1 Pro by around 4 points on Arena's text preference leaderboard and also leads on Arena coding preference. Arena rankings rely on human voting — real users comparing real outputs. That signal matters.
MarkTechPost's analysis put it directly: the GDPval gap is "a critical vulnerability in Gemini 3.1 Pro's architecture" for expert real-world task synthesis. Benchmark tests and human preference tests are measuring different things — and Gemini leads one while Claude leads the other.

This is where Gemini 3.1 Pro has a structural, not marginal, advantage.
For most teams evaluating Claude today, the effective comparison is 1M vs 200K — a 5x difference. Claude's 1M context is on the roadmap and working for specific enterprise accounts, but it's not the default experience you'll get when you start building.
What does 1M tokens actually let you do? Concretely:
Gemini also scores 76% on MRCR v2 (long-context needle-in-a-haystack retrieval across 1M tokens) — strong evidence the model actually uses the context it receives, not just accepts it. This matters. A large context window is only useful if retrieval across it is reliable.
One thing I tracked in actual testing: Gemini 3.1 Pro tends to generate longer responses by default. The max output is 65,536 tokens vs Claude Opus 4.6's 128,000 tokens — Claude actually has more output capacity. But in practice, Gemini has historically generated more verbose responses at similar prompts.
Google specifically tuned 3.1 Pro to address this. JetBrains' Director of AI noted that 3.1 Pro is "more efficient" — requiring fewer output tokens while delivering more reliable results. The verbosity reduction is one of the documented improvements in this release.
For cost-sensitive production workloads, this matters: output tokens are billed at $12/M for Gemini vs $25/M for Opus. If Gemini generates 20% more output tokens by default, that partially offsets the per-token advantage. Monitor your actual output token counts during testing rather than assuming the per-token price difference translates directly to cost savings at the same ratio.
Per the Gemini API pricing page and Anthropic's official pricing, these are the verified February 2026 rates:
Both models also have batch pricing (50% off for async workloads): Gemini Batch API brings input to $1.00/M; Anthropic's batch brings Opus input to $2.50/M. The relative gap holds.
The pricing difference is real, but "7x cheaper" — the number in the title Meta Description — overstates it. The accurate figure is 2.5x cheaper on input, 2.1x cheaper on output. At production scale, that difference still compounds into significant numbers.
At 50 million output tokens per month (a realistic mid-size production workload):
At 200 million output tokens/month — enterprise scale:
For teams where GDPval expert task quality is critical — legal review, financial analysis, complex knowledge work where Claude's 289-Elo lead translates to real output quality differences — that premium may be genuinely worth paying. For teams running general-purpose AI infrastructure where benchmark gaps won't show up in your specific task distribution, Gemini 3.1 Pro's cost advantage is substantial.

You need the largest production context window as a default. Gemini's 1M tokens is standard for all API users. If you're building full-codebase analysis, long-document processing, or large-scale data synthesis, this isn't close right now.
Raw reasoning and scientific computation are core to your product. 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond are real, independently verified scores. If your users are researchers, engineers running novel analyses, or scientific computing workflows, Gemini 3.1 Pro's reasoning edge will show up.
You're optimizing for cost at production scale. At 2.5x cheaper on input and 2.1x cheaper on output, the math is hard to argue with for workloads where both models would produce acceptable quality outputs.
You want competitive programming or algorithmic coding performance. LiveCodeBench Pro Elo of 2887 significantly outpaces GPT-5.2. For pure algorithm problems, Gemini leads.
You need three-tier thinking level control. The Low/Medium/High thinking system lets you dial reasoning depth per request — useful for cost management in production workloads with mixed task complexity.

Expert knowledge work quality is non-negotiable. The 289-Elo GDPval-AA gap is the clearest signal in the comparison. If your use case involves professional knowledge synthesis, business documentation, financial analysis, or legal reasoning — tasks where output quality directly affects decisions — Opus has a meaningful lead.
Agentic computer use is core to your workflow. Opus 4.6 scores 72.7% on OSWorld — autonomous GUI navigation, form filling, multi-step desktop workflows. Gemini 3.1 Pro doesn't have a published score on this benchmark, and third-party evaluations suggest it trails significantly on computer-use tasks.
You're building multi-agent systems. Anthropic's Agent Teams feature enables parallel multi-agent Claude Code orchestration — Anthropic demonstrated it building a working C compiler from scratch (100,000 lines, boots Linux on three CPU architectures). This is a first-party capability with no direct Gemini equivalent.
Human preference on writing and document quality matters. Arena rankings reflect real user preferences. Claude Opus 4.6 leads Gemini by ~4 points on text quality preference. If your users are reading model outputs and judging quality subjectively, this signal matters.
Your team uses Claude Code or Anthropic's enterprise integrations. Claude in Excel, Claude in PowerPoint, and Claude Code's Agent Teams are Anthropic-specific integrations with no current Gemini equivalent. If your stack is Anthropic-native, the switching cost isn't justified by benchmark gaps alone.
Two weeks of frontier AI releases and the honest conclusion is: the choice between Gemini 3.1 Pro and Claude Opus 4.6 depends almost entirely on which benchmark category maps to your actual work.
Reasoning, scientific computation, large-context processing, cost efficiency → Gemini 3.1 Pro.
Expert knowledge work, agentic task reliability, human writing preference, computer use → Claude Opus 4.6.
If you're unsure which category your workflow falls into, the answer is probably to run both on your actual production tasks before committing. The benchmark numbers are real — but they're measuring specific things, and your specific thing might not be one of them.
At Macaron, we build tools that help you structure AI outputs into trackable, repeatable workflows — so you can actually measure which model delivers better results for your use case rather than guessing from benchmark tables. Try it free at macaron.im.
Does Gemini 3.1 Pro beat Claude Opus 4.6 overall?
On benchmark count alone, yes — Gemini leads 13 of 16 evaluated benchmarks. But the benchmarks Claude leads (GDPval expert tasks, Arena user preference, agentic computer use) are arguably more directly connected to real enterprise workflows. There's no clean overall winner; it depends on which performance dimension matters for your specific use case.
Is the context window comparison fair?
Mostly no. Gemini 3.1 Pro's 1M context is standard for all API users. Claude Opus 4.6's 1M context is in beta, restricted to Tier 4 organizations with custom rate limits. For most developers evaluating today, the effective comparison is 1M vs 200K. That said, Anthropic is actively expanding 1M access, and the situation may change by Q2 2026.
Why does Claude lead on GDPval-AA if Gemini leads on ARC-AGI-2? They're measuring different things. ARC-AGI-2 tests abstract reasoning on novel visual-logic puzzles — a skill that correlates with scientific and mathematical capability. GDPval-AA tests performance on professional knowledge work tasks across 44 occupations, scored by expert raters. Strong abstract reasoning doesn't automatically translate to strong expert-task performance. Claude appears to have better instruction-following, writing quality, and professional context judgment even where Gemini has stronger raw reasoning.
Should I switch from Claude Opus 4.6 to Gemini 3.1 Pro?
Run the comparison on your actual task distribution, not headline benchmarks. If your core use cases are in Gemini's stronger domains (reasoning, scientific computation, large-context processing) and you're sensitive to cost, the case for switching is real. If your use cases are in Claude's domains (expert knowledge work, agentic computer use, writing quality), the benchmark gap doesn't support the migration.
Are both models available now?
Claude Opus 4.6 is fully available via the Anthropic API (model ID: claude-opus-4-6) as of February 5, 2026. Gemini 3.1 Pro is in preview (model ID: gemini-3.1-pro-preview) as of February 19, 2026. The preview status means Google may adjust pricing and specs before general availability.
What about Claude Sonnet 4.6 in this comparison?
Worth mentioning: Claude Sonnet 4.6 (released February 17, $3/$15 per million tokens) actually beats Opus 4.6 on GDPval-AA (1633 vs 1606 Elo) and Finance Agent benchmarks, while scoring within 1-2 points of Opus on SWE-Bench and OSWorld. For most workloads, Sonnet 4.6 vs Gemini 3.1 Pro is the more relevant comparison — and Sonnet's pricing ($3/$15) puts it much closer to Gemini's cost structure.