Gemini 3.1 Pro Benchmarks Explained: What the Scores Actually Mean

Hey fellow model-watchers — if you've spent any time in the last week trying to decode what "77.1% on ARC-AGI-2" actually means for your workflow, this one's for you.

I'll be honest: my first reaction to Google's February 19 Gemini 3.1 Pro launch was。skepticism. Another launch-day benchmark dump. Another set of numbers I'd have to cross-reference against three other sources before I could trust them. But this time, the numbers held up — and a few of them genuinely surprised me.

So here's the plain-English breakdown I wish existed when I started digging.


Why These Benchmarks Matter (And Where They Fall Short)

Here's what I keep reminding myself every time a new model drops: benchmarks are proxies, not ground truth. They're asking "how close can this model get to human-level performance on a specific, controlled task?" That's useful. It's not the whole picture.

What benchmarks can't tell you: how the model feels to use. Whether it follows weird edge-case instructions. How it handles your specific data format. Those are things you have to test yourself.

But here's where they do matter: when you're choosing between frontier models for an expensive production workload, you need some signal beyond vibes. Benchmarks give you that signal — imperfectly, but meaningfully.

One more caveat worth flagging: Google self-reported these scores on launch day. The AI community has learned to be cautious about that. Independent verification is still catching up as of late February 2026. I've cross-checked against Artificial Analysis and third-party reviews — the numbers are consistent — but keep that asterisk in mind.


ARC-AGI-2 — The Reasoning Test That's Hardest to Game

What It Tests and Why 77.1% Is a Big Deal

ARC-AGI-2 is designed to be ungameable. The whole point is to present novel logic puzzles — patterns a model can't have memorized from training data. It's measuring something closer to fluid reasoning than knowledge recall.

Gemini 3.1 Pro hit 77.1% on this benchmark. Gemini 3 Pro, released in November, scored 31.1%. That's not an incremental improvement. That's a 2.5x jump in one point release.

To put the competitive landscape in perspective:

Model
ARC-AGI-2 Score
Gemini 3.1 Pro
77.10%
Claude Opus 4.6
68.80%
GPT-5.3-Codex
52.90%
Gemini 3 Pro
31.10%

(Source: Google DeepMind model card, February 2026)

That gap over Claude Opus 4.6 (8+ percentage points) is significant. This is the benchmark where Gemini 3.1 Pro has its clearest, most defensible lead.

How This Compares to Every Other Frontier Model

What makes the ARC-AGI-2 result genuinely interesting isn't just the score — it's what caused it. Google attributes this to more efficient thinking: extracting more insight per compute token during the reasoning chain. It's not just "bigger model, better score." The architecture is doing something different.

For tasks like synthesizing conflicting research, working through multi-step logic problems, or generating code for genuinely novel requirements (not boilerplate), this improvement is load-bearing. That said — a score of 77.1% also means the model fails roughly 1 in 4 novel reasoning tasks. Treat it as a strong tool, not a magic one.


Coding Benchmarks — SWE-Bench and LiveCodeBench

SWE-Bench 80.6%: Real GitHub Issues, Real Fixes

SWE-Bench Verified is my favorite coding benchmark to cite because it's grounded in reality. It takes actual GitHub issues — real bugs in real codebases — and asks models to fix them end-to-end. No toy problems.

Gemini 3.1 Pro scored 80.6% on SWE-Bench Verified. Claude Opus 4.6 scored 80.8%. That's a 0.2 percentage point difference — effectively a tie.

Model
SWE-Bench Verified
Claude Opus 4.6
80.80%
Gemini 3.1 Pro
80.60%
GPT-5.3-Codex (SWE-Bench Pro)
56.80%

(Note: GPT-5.3-Codex benchmarks were only reported for a subset of tasks, making direct comparison difficult.)

For most software engineering workflows, the practical difference between these two models here is negligible. Pick based on other factors — cost, context window, ecosystem fit.

LiveCodeBench Elo 2887 — Where GPT-5.3-Codex Still Leads

LiveCodeBench measures competitive programming: isolated algorithmic problems, not real-world codebases. Gemini 3.1 Pro scores 2887 Elo here, significantly ahead of GPT-5.2 at 2393. It's a strong signal for models that need to generate novel algorithms under constraint.

But here's the part that often gets buried: for terminal-heavy agentic coding tasks, GPT-5.3-Codex still leads. Terminal-Bench 2.0 shows GPT-5.3-Codex at 77.3% vs Gemini 3.1 Pro at 68.5%. If your workflow is agents running bash loops in a terminal environment, that gap matters.


GPQA Diamond — PhD-Level Science at 94.3%

GPQA Diamond is a benchmark of graduate-level science questions — the kind that require genuine domain expertise, not pattern matching. Gemini 3.1 Pro scores 94.3% here, ahead of Claude Opus 4.6 at 91.3% and GPT-5.2 at 92.4%.

Model
GPQA Diamond
Gemini 3.1 Pro
94.30%
GPT-5.2
92.40%
Claude Opus 4.6
91.30%

For scientific research workflows — literature synthesis, hypothesis generation, analyzing methodology in papers — this result has real practical weight. Pair it with the 1M token context window and you have a model that can ingest multiple research papers simultaneously and reason across them.

That said, I'd still spot-check any scientific output against primary sources. A 94.3% score means 5.7% of PhD-level questions get the wrong answer. For anything high-stakes, verify.


The One Benchmark Where Gemini 3.1 Pro Loses

GDPval-AA Expert Preference: Claude Wins by a Lot

This is the result I find most interesting — and the one Google doesn't lead with in their launch post.

GDPval-AA measures performance on real-world expert tasks: financial modeling, business documentation, research synthesis, strategic planning. It's not about solving abstract puzzles. It's about doing the kind of knowledge work that shows up in actual professional environments.

The Elo scores here are striking:

Model
GDPval-AA Elo
Claude Sonnet 4.6 (Thinking Max)
1633
Claude Opus 4.6
1606
Gemini 3.1 Pro
1317

(Source: multiple third-party benchmark analyses, February 2026)

That's not a close race. A 316-point Elo gap is significant. Claude Sonnet 4.6 — not even Opus — leads this benchmark by a wide margin.

What This Means for Writing and Expert Analysis Tasks

Here's my take: Gemini 3.1 Pro appears to have been optimized for breadth and algorithmic performance. Claude was optimized for depth and polish on expert-level output. These are different things, and they show up clearly in GDPval-AA.

If your work involves producing professional deliverables — reports, analysis documents, anything where output quality is the primary constraint — Claude models currently hold a meaningful edge. The benchmark numbers aren't the full explanation, but they're pointing at something real.

At Macaron, we think about this stuff constantly — not because benchmarks are the point, but because the model you're running underneath your workflow determines whether your plans actually land. Macaron lets you delegate real tasks and build personal mini-apps without rewiring your entire stack or managing API costs yourself. If you want to test whether a structured AI workflow actually survives contact with your real work, you can run a trial task inside Macaron — low stakes, you judge the output yourself.


Frequently Asked Questions

Q: Is Gemini 3.1 Pro better than Claude Opus 4.6 overall? A: "Better overall" depends entirely on your use case. Gemini 3.1 Pro leads on ARC-AGI-2, GPQA Diamond, LiveCodeBench, and APEX-Agents. Claude Opus 4.6 leads on GDPval-AA expert tasks and Humanity's Last Exam with tools. They're essentially tied on SWE-Bench Verified. If cost matters, Gemini is 7.5x cheaper per input token ($2 vs $15 per million tokens).

Q: What does ARC-AGI-2 actually test? A: Novel pattern recognition that can't be memorized from training data. It's designed to be a proxy for general reasoning ability rather than knowledge recall. The full methodology is documented by the ARC Prize organization.

Q: Is the 80.6% SWE-Bench score reliable? A: SWE-Bench Verified is generally considered one of the more rigorous coding benchmarks because it uses real GitHub issues. The score is consistent with third-party analyses, though any launch-day self-reported score warrants some caution until independently verified.

Q: What is GDPval-AA and why does it matter? A: It measures economically valuable tasks — the kind of expert work that shows up in offices, not labs. Financial analysis, document drafting, research synthesis. It tends to correlate better with real-world professional usefulness than abstract reasoning benchmarks.

Q: Where can I access Gemini 3.1 Pro right now? A: As of late February 2026, it's available in preview via Google AI Studio, the Gemini API, Vertex AI, Gemini CLI, Gemini app (Pro/Ultra plans), and NotebookLM. General availability timing hasn't been announced yet.

Q: Which model should I use for coding in 2026? A: It depends. For competitive algorithms and high-volume production: Gemini 3.1 Pro (better cost, strong LiveCodeBench). For precision real-world bug-fixing: Claude Opus 4.6 (slightly edges SWE-Bench Verified). For terminal-heavy agentic tasks: GPT-5.3-Codex (leads Terminal-Bench 2.0 at 77.3%). Most serious engineering teams are routing tasks across models rather than committing to one.

Hey, I’m Hanks — a workflow tinkerer and AI tool obsessive with over a decade of hands-on experience in automation, SaaS, and content creation. I spend my days testing tools so you don’t have to, breaking down complex processes into simple, actionable steps, and digging into the numbers behind “what actually works.”

Apply to become Macaron's first friends