DeepSeek V4 Benchmarks: MMLU, HumanEval & SWE-bench

Fellow benchmark hunters — if you've been trying to find a clean comparison of DeepSeek V4's numbers without wading through a dozen sites that all cite the same unverified leak, I did that wade for you.

I'm Hanks. I test AI infrastructure in real workflows, and when a model hasn't shipped yet, I make sure to label what's real versus what's a press-release number someone's hoping you'll memorize before you can check.

My core question going into this: which DeepSeek V4 benchmark claims can you actually rely on right now, and what's the verified baseline you should be comparing against?

Here's the honest answer — and it's more useful than you'd expect.

Benchmark Summary Table

Before anything else: a clear status label on every number in this article.

Label key used throughout:

✅ Verified — from official technical report, Hugging Face model card, or independent community reproduction
⚠️ Claimed — from internal DeepSeek testing only, not independently reproduced
🔲 Unknown — not disclosed

Benchmark

DeepSeek V3 ✅

DeepSeek R1 ✅

DeepSeek V3.2 ✅

DeepSeek V4 ⚠️

GPT-4o ✅

Claude Opus 4.5 ✅

MMLU

88.5

90.8

~88+

⚠️ TBC

~88

🔲

MMLU-Pro

75.9

⚠️ TBC

~72

🔲

HumanEval

65.2*

90.2

~90+

⚠️ 90%

~90

⚠️ 88%

SWE-bench Verified

—

67.8

⚠️ 80%+

—

✅ 80.9%

MATH-500

—

97.3

—

⚠️ TBC

—

🔲

AIME 2025

—

87.5

89.3

⚠️ TBC

—

🔲

LiveCodeBench

—

65.9

74.1

⚠️ TBC

—

🔲

GPQA Diamond

59.1

71.5

79.9

⚠️ TBC

—

✅ 85.2%

Context Window

128K → 1M (API)

164K

128K

⚠️ 1M

128K

200K

*V3's HumanEval baseline score reflects the pre-R1-distillation checkpoint. V3.2 and R1 distillation significantly improved coding output quality.

Bottom line on V4: every V4 number in this table is an internal claim from DeepSeek or secondary reporting of those claims. Not one has been independently reproduced. That's not a reason to dismiss them — V3's internal claims proved accurate — but it's a reason not to make infrastructure decisions based on them yet.

Coding — HumanEval & SWE-bench

Coding is V4's explicit design target. This is where the benchmark story matters most — and where the gap between claimed and verified is widest.

V4 vs GPT-4o vs Claude

HumanEval — what it actually tests: HumanEval is the most cited coding benchmark in existence. It grades models on whether they can write Python functions that pass hidden unit tests — no partial credit, no subjective scoring. A 90% score means 9 out of 10 functions work correctly on the first try.

Here's where things get complicated. The claimed V4 score (90%) is essentially identical to what R1 and GPT-4o already achieve. If that number holds under independent testing, V4 matches the current ceiling rather than breaking through it.

Model

HumanEval Score

Status

Notes

DeepSeek R1

90.20%

✅ Verified

Community reproduced

GPT-4o

~90%

✅ Published

OpenAI model card

DeepSeek V4

90%

⚠️ Claimed

Internal testing only

Claude (est.)

88%

⚠️ Secondary claim

Not from official card

Mistral Large

92.00%

✅ Leaderboard

Vertu 2026 leaderboard

The more important benchmark for V4 is SWE-bench.

SWE-bench Verified — what it actually tests: SWE-bench doesn't test toy functions. It presents models with 500 real GitHub issues from popular open-source repositories and asks them to generate patches that actually fix the bugs. This is the benchmark that best reflects real developer workflows — messy, cross-file, context-heavy.

The current verified leader is Claude Opus 4.5 at 80.9% solve rate. DeepSeek V3.2 sits at 67.8% — verified by the open-source leaderboard as of February 2026. MiniMax M2.5, a less-discussed model, posts 80.2% — the highest verified open-source SWE-bench score currently available.

V4's claimed target: exceeding 80%. If that holds under independent testing, it would be the first open-source model to match or beat Claude Opus 4.5 on real-world software engineering tasks. That's a meaningful milestone — not just a number.

Model

SWE-bench Verified

Status

Claude Opus 4.5

80.90%

✅ Verified (current leader)

MiniMax M2.5

80.20%

✅ Verified

DeepSeek V4

80%+ target

⚠️ Claimed, unverified

DeepSeek V3.2

67.80%

✅ Verified

DeepSeek V3.1

66.00%

✅ Verified

The architecture case for V4 hitting this target is stronger than you'd expect from a typical pre-launch claim. Engram's O(1) memory lookup directly addresses why current models fail SWE-bench at the hard end: they lose cross-file context. The 1M token window means V4 can ingest an entire repository before generating its first patch. That's a structural advantage, not a fine-tuning trick.

Reasoning — MMLU & MATH-500

Reasoning benchmarks tell a different story. Here, V3's established lineage already competes at the frontier — and V4's improvements are targeted at coding scale, not pure reasoning benchmarks.

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects from elementary science to professional law. It's the baseline general intelligence exam for LLMs. Think of it as the SAT of AI benchmarks.

Verified scores as of February 2026:

Model

MMLU

MMLU-Pro

Status

DeepSeek R1

90.8

✅ Verified

DeepSeek V3.2

~88+

✅ Verified

DeepSeek V3.1

88.5

75.9

✅ Verified

Qwen 3 235B

—

84.4

✅ Leaderboard

GPT-4o

~88

~72

✅ Published (approx.)

V4 MMLU claims haven't surfaced in any traceable form. The safe assumption: V4 will perform similarly to or better than V3.2 on general knowledge tasks, since it extends the same base architecture. But reasoning benchmark leadership in the DeepSeek family currently sits with R1, not the V-series.

MATH-500 is where R1 genuinely stands out: 97.3%, verified. This is a near-ceiling score. The R1 training pipeline — multi-stage reinforcement learning without early supervised fine-tuning — produces reasoning chains that self-verify on mathematical problems in a way V-series models don't replicate. V4's architecture is optimized for code, not theorem-proving. Expect V4 to be competitive on MATH-500, but don't expect it to displace R1 as the math leader.

Multilingual Performance

This is the most underreported part of the DeepSeek benchmark story — and one where the V-series has a genuine, verified edge.

DeepSeek V3 was pretrained on a multilingual corpus with English and Chinese as the majority languages, with optimized compression efficiency for multilingual tokenization via its 128K-token Byte-level BPE vocabulary. V3.1 extended this further: over 100 languages supported, including low-resource and Asian languages, with particular strength in Chinese factual knowledge that outperforms closed-source models.

Concrete verified performance anchors from the V3 technical report:

Benchmark

DeepSeek V3

Notes

C-Eval (Chinese)

Top open-source

Surpasses Qwen2.5 72B on most subcategories

CMMLU

Near-Qwen2.5 72B

Only benchmark where Qwen2.5 72B is competitive

Multilingual reasoning

Outperforms LLaMA 3.1 405B

Despite 91% fewer activated parameters

V4 multilingual numbers are not yet public. What we know from the architecture: the 128K-token vocabulary and tokenizer from V3 carry forward, and V4's training dataset composition hasn't been disclosed. The reasonable expectation is parity or improvement on V3.1's multilingual baseline. For teams with Chinese or Asian language requirements, V3.1 is already a safer bet than any Western frontier model — and V4 is likely to extend that advantage.

Long-Context Needle-in-Haystack

This benchmark is the clearest differentiator between V4 and everything else currently available — and it's the one with the most verifiable supporting evidence.

Needle-in-Haystack (NIAH) tests whether a model can retrieve a specific piece of information embedded at various positions within a very long context. It's the benchmark that exposes context window degradation — models that claim 128K or 1M tokens but lose coherence at 50K.

Verified V3 baseline: full NIAH accuracy maintained across the entire 128K context window. No degradation at the edges. This is documented in the official V3 model card on Hugging Face.

V4 + Engram claimed result: 97% accuracy at 1 million tokens. The comparison point is 84.2% for standard attention architectures at the same scale. That's not a marginal improvement — it's the difference between a model that can actually reason over a full codebase and one that statistically loses the thread.

Model

Context Window

NIAH Accuracy

Status

DeepSeek V3

128K

~100% at 128K

✅ Verified

DeepSeek V4

97% at 1M

⚠️ Engram paper (peer-reviewed), model unverified

Standard attention (baseline)

84.20%

✅ Engram paper benchmark

Llama 4 Maverick

TBC

🔲 Not published

Nemotron Nano 30B

TBC

🔲 Not published

The 97% figure comes from the Engram paper (arXiv:2601.07372), which is peer-reviewed and independently reproducible. The paper's methodology and results are verifiable. What isn't verified is that V4 actually implements Engram as described — that requires the model to ship and community testers to confirm. The February 11 silent upgrade expanding context windows to 1M tokens in DeepSeek's production API, confirmed by independent community testing showing >60% accuracy at full 1M length, is the strongest signal that this capability is real and being staged for release.

Independent vs Self-Reported Results

This is the section most benchmark articles skip. Let me be direct about the sourcing problem here.

What's verified (traceable to technical reports, Hugging Face model cards, or community reproduction):

All DeepSeek V3, V3.1, and V3.2 scores listed above
DeepSeek R1 reasoning scores (MMLU 90.8, MATH-500 97.3, AIME 79.8%)
Claude Opus 4.5 SWE-bench 80.9% (Anthropic published)
Engram paper NIAH improvement (97% vs 84.2% at 1M tokens)

What's internal-claim only:

V4 HumanEval 90% (cited from sources referencing "internal benchmark testing")
V4 SWE-bench 80%+ (same sourcing — The Information reporting, not DeepSeek technical report)
V4 total parameter count ~1T (inferred from MODEL1 repository analysis, not confirmed)

What's completely unverified and should be ignored:

"$0.10/M token pricing" — no source traceable to DeepSeek
"98% HumanEval" — appears in some blogs, not in any primary source
Specific training compute costs for V4

The pattern from R1's launch is instructive here: DeepSeek maintained operational silence before release, then published a comprehensive technical report with reproducible benchmark methodology on launch day. Within 48 hours of R1's release, community testers had independently confirmed the major benchmark claims. Expect the same pattern with V4.

Until that happens: use V3.2 scores as your planning baseline. V3.2 is live, verified, and $0.27/1M tokens at input.

How to Read These Benchmarks

This confusion me for longer than I'd like to admit when I first started tracking model evals. Here's the framework I landed on.

HumanEval measures function-level code generation. One function, one docstring, hidden unit tests. It's clean and reproducible — which is why it became the standard. Its weakness: it doesn't test multi-file reasoning, dependency tracing, or real repository context. A model can score 90% on HumanEval and still struggle with SWE-bench.

SWE-bench measures real software engineering. Real GitHub issues, real repositories, patches that must actually work. It's messy and context-heavy in exactly the way production code is. High SWE-bench scores are harder to fake than HumanEval scores because the test surface is much larger. According to Analytics Vidhya's 2026 benchmark guide, SWE-bench is now considered the industry standard for evaluating real-world software engineering capability — displacing HumanEval as the primary coding benchmark for enterprise evaluation.

MMLU measures breadth of knowledge. 57 subjects, multiple choice. It's useful for comparing general capability floors across models. Its weakness: saturating models (90%+) are hard to differentiate because the ceiling effects are real. MMLU-Pro is more discriminating — harder questions, less guessable.

MATH-500 and AIME measure mathematical reasoning depth. These are where R1's chain-of-thought training shows up most clearly. V4 is not targeting this space.

Needle-in-Haystack measures context faithfulness at scale. It's the only benchmark that directly tests whether a model's claimed context window is actually usable. A 1M token context that degrades to 84% accuracy is effectively a much shorter usable context window.

One more thing worth flagging: benchmark scores depend heavily on evaluation setup. Prompt engineering, output length limits, and temperature settings all affect results. The DeepSeek V3 technical report notes that all their evaluations used an 8K output length limit — changing that limit changes scores. When you see benchmark comparisons between models, check whether the evaluation conditions match.

The Number That Actually Earns Your Attention

Here's where I land after tracking this for weeks: the V4 benchmark story isn't really about HumanEval. That number is already saturated — R1, GPT-4o, and Mistral Large are all clustering around 90%.

The number that matters is SWE-bench. If V4 independently confirms 80%+, it's the first open-source model to match Claude Opus 4.5 on real-world software engineering tasks — and it does it at expected DeepSeek pricing (historically 20–50× cheaper than Western alternatives). That's the actual claim worth waiting for verification on.

Until then, V3.2 at 67.8% SWE-bench and $0.27/1M tokens is the model you should be running today.

At Macaron, we turn your model evaluation decisions into structured, executable workflows — so you're testing AI against your actual tasks, not synthetic benchmarks. Try it free at macaron.im and judge the results yourself.

FAQ

Q: What are the confirmed DeepSeek V4 benchmark scores? None. As of February 28, 2026, DeepSeek V4 has not officially launched and no independent evaluations exist. The 90% HumanEval and 80%+ SWE-bench claims are from internal testing only.

Q: How does DeepSeek V3.2 compare to V4 on current benchmarks? V3.2 is the verified frontier model from DeepSeek right now: MMLU-Pro 85.0, SWE-bench 67.8, LiveCodeBench 74.1, AIME 2025 89.3. These are independently confirmed. V4 claims to exceed V3.2 significantly on SWE-bench — but that requires independent confirmation.

Q: Is DeepSeek R1 better than V4 for reasoning tasks? For pure reasoning (MATH-500, AIME, multi-step logic), R1's chain-of-thought training pipeline produces results that V4's architecture isn't specifically targeting. R1 at 97.3% MATH-500 and 90.8 MMLU is the current open-source reasoning benchmark leader. V4 targets coding at repository scale — a different capability dimension.

Q: Does SWE-bench actually matter for real development work? More than HumanEval does. SWE-bench uses real GitHub issues from popular open-source projects — the same kinds of problems developers encounter daily. A model that can generate a working patch for an actual repository bug is directly useful. A model that can write a Python function from a docstring is a narrower capability.

Q: What's the best verified benchmark baseline for planning V4 adoption? Use DeepSeek V3.2: SWE-bench 67.8%, MMLU-Pro 85.0%, LiveCodeBench 74.1%. This is the confirmed current performance tier. If V4 delivers on its SWE-bench claims, the coding workflow improvement would be roughly 18 percentage points — a meaningful jump. Plan conservatively against V3.2 numbers; treat V4 claims as upside if they hold.

Q: When will independent V4 benchmarks be available? Within 48–72 hours of V4's official launch, based on R1's community eval pattern. The open-source LLM leaderboard is updated frequently and is where community-verified scores typically surface first.

Q: Is the Engram NIAH result (97% at 1M tokens) real? The paper result is peer-reviewed and reproducible. The Engram architecture achieving 97% vs 84.2% for standard attention at 1M tokens is a real finding — verifiable from the published paper and code at arXiv:2601.07372. Whether V4 actually implements Engram as described requires the model to ship. The February 11 production API upgrade to 1M tokens suggests implementation is real and in staged rollout.

From next article: