DeepSeek V4 Parameters: 671B MoE Architecture Explained

What's up, model architecture nerds — if you've been staring at the "1 trillion parameters" headlines and wondering whether that number actually means anything for your workflow, this one's for you.

I'm Hanks. I test AI infrastructure inside real tasks, not slide decks. And right now, the DeepSeek parameter story is one of the most misread numbers in the space.

Here's the question I kept asking myself: if a model has 671 billion parameters but only uses 37 billion per token, what are you actually paying for — and what does that mean when V4 scales to a trillion?

Turns out, everything. Let me break it down.


Total vs Active Parameters — Why It Matters

Most AI coverage treats "parameter count" as a single number. It isn't. For Mixture-of-Experts models like DeepSeek, there are two numbers that actually matter:

  • Total parameters: every weight in the model, across all experts
  • Active parameters: the subset activated for any single token during inference

These are not the same thing, and confusing them leads to completely wrong conclusions about cost, speed, and hardware requirements.

Here's the clearest way I've found to think about it: total parameters determine what the model knows. Active parameters determine what it costs to think.

A model with 671B total but only 37B active is not a 671B model in any practical inference sense. It's a 37B model with 671B worth of specialist knowledge available on demand. DeepSeek-V3, with 671B total parameters and 37B active per token, achieves top-tier performance across multiple benchmarks while maintaining training efficiency with a cost of only 2.788 million H800 GPU hours.

That cost figure is the important number. It's what makes the MoE architecture economically viable at this scale — and it's the foundation V4 is built on.

Metric
What It Measures
Why It Matters
Total parameters
Knowledge capacity
Model quality ceiling
Active parameters
Per-token compute cost
Inference speed + hardware cost
Activation ratio
Efficiency of the architecture
Cost per useful output

For V3 (the confirmed baseline), the activation ratio is 37B ÷ 671B = 5.5%. Only 1 in 18 parameters fires per token. That's the engineering feat.


MoE: How 671B Becomes 37B Active

Expert Count & Routing Logic

The Mixture-of-Experts architecture routes each token through a small subset of specialist networks — called "experts" — instead of passing it through the full model every time. Dense models activate all of their weights on every token. MoE models like DeepSeek, by contrast, only activate a small subset of parameters per token — typically well under 10%.

For V4, the architecture extends this with a "Top-16" routing strategy. DeepSeek V4 is a technical marvel of sparse architecture, utilizing a massive 1-trillion parameter total count while only activating approximately 32 billion parameters for any given token. This "Top-16" routed MoE strategy allows the model to maintain the specialized knowledge of a titan-class system without the crippling latency or hardware requirements usually associated with models of this scale.

A few things worth unpacking in that:

Why 16 experts per token? Routing through more experts increases answer quality but costs more compute. Routing through fewer is faster but narrower. 16 is DeepSeek's current sweet spot — wide enough for complex reasoning, efficient enough for production throughput.

What are "shared generalists"? Not all experts are specialists. V4's MoE stack includes shared expert layers that activate for every token — functioning like a base reasoning layer on top of which routed specialists operate. V4's MoE stack reflects the current frontier in expert-based design: wide models with many small experts, rich per-token mixtures, shared generalists, and robust routing that scales.

Load balancing without auxiliary loss: V3 and V4 both use auxiliary-free load balancing — which means the router learns to distribute tokens evenly across experts without a separate loss term penalizing imbalance. This is a meaningful training stability improvement over earlier MoE designs.

Training Token Count

DeepSeek-V3 (671B total / 37B active) was trained on 14.8T tokens. This is the verified V3 baseline. V4's training token count has not been officially confirmed, but the architectural improvements — particularly mHC stability and Engram memory — are specifically designed to make trillion-parameter training viable on the same hardware budget.

The mHC paper (published January 1, 2026, co-authored by DeepSeek founder Liang Wenfeng) addresses a specific failure mode at scale: Traditional hyper-connections can expand residual stream width and improve connectivity patterns, but simultaneously undermine the identity mapping principle that makes residual networks trainable — leading to numerical instability that crashes large-scale training runs. The mHC solution projects connection matrices onto a mathematical manifold using the Sinkhorn-Knopp algorithm, controlling signal amplification.

The practical result: V4 can scale to 1 trillion parameters without the training instability that would otherwise require restarts and burn budget.


Parameter Count vs GPT-4o, Claude, Llama 3

Here's the honest comparison table as of February 2026. Note that GPT-4o and Claude's parameter counts are not officially disclosed — the numbers below are widely cited community estimates:

Model
Total Parameters
Active per Token
Architecture
Context Window
Open Source
DeepSeek V4
~1T (pre-release)
~32B
MoE + Engram
1M tokens
Expected Apache 2.0
DeepSeek V3
671B
37B
MoE
128K → 1M (API)
MIT
DeepSeek R1
671B
37B
MoE + RL CoT
164K
MIT
GPT-4o
~200B est.
~200B est.
Dense (est.)
128K
No
Claude Sonnet 4.5
Not disclosed
Not disclosed
Unknown
200K
No
Llama 3.3 70B
70B
70B
Dense
128K
Llama license
Llama 4 (est.)
400B+
~17B est.
MoE
128K+
Partial

A few things stand out here that most comparisons miss:

GPT-4o's architecture is not public. OpenAI has never confirmed parameter counts. The ~200B dense estimate comes from community reverse-engineering and is frequently wrong. What we know for certain: GPT-4o features a context window of 128K tokens. Input costs $2.50 per million tokens and output costs $10 per million tokens.

Claude's architecture is similarly undisclosed. The SWE-bench performance is documented — Claude Opus 4.5 leads enterprise coding with an 80.9% SWE-bench score and 54% market share among enterprise developers. The parameter count behind that score is not public.

Llama 3.3 70B is a dense model. All 70B parameters activate per token. That's why it's faster on consumer hardware but costs more per token at scale than comparably-performing MoE alternatives.

The real comparison isn't parameters — it's performance-per-dollar. GPT-4o is roughly 29.8x more expensive compared to DeepSeek-V3 for input and output tokens. At matched benchmark performance on HumanEval (~90% for both), that cost delta is the actual competitive story.


Impact on Inference Speed and Cost

This is where the MoE parameter structure has direct practical consequences.

Because only ~37B parameters activate per token (V3 baseline), inference throughput scales differently than a comparably-sized dense model. A 671B dense model would require ~5x the compute per token compared to V3's 37B active slice.

Hardware requirements (V3, confirmed):

  • Full precision (BF16): ~1.3TB VRAM — requires multi-node clusters
  • Q8 quantization: ~670GB — achievable with 8× H100s
  • Q4 quantization: ~336GB — achievable with 4× H100s or dual A100 nodes
  • Distilled variants (32B): single 24GB GPU, near full reasoning quality

V4 projection (pre-release, based on architectural papers):

  • Active parameter slice stays near 32B per token — inference cost stays similar to V3
  • 1T total parameters mean larger VRAM for the full weight set: ~350–400GB at Q4
  • Dual RTX 4090 or single RTX 5090 for quantized inference (consumer tier)
  • Engram's host DRAM offloading changes the calculus: static knowledge goes to system RAM, GPU VRAM focuses on active reasoning

The Engram contribution here is significant. A 100-billion-parameter embedding table is entirely offloaded to host DRAM with throughput penalties below 3%. That means the effective VRAM pressure for trillion-parameter inference is lower than the total weight count implies.

API pricing (current, verified):

Model
Input ($/1M tokens)
Output ($/1M tokens)
DeepSeek V3
$0.27
$1.10
DeepSeek R1
$0.55
$2.19
GPT-4o
$2.50
$10.00
Claude Sonnet 4.5
$3.00
$15.00
Claude Opus 4.5
$15.00
$75.00

V4 pricing is unconfirmed. Historical pattern: DeepSeek has historically priced aggressively — 20–50× cheaper than OpenAI on comparable models.


What Changed V3 → V4

V3 is the confirmed baseline. V4 extends it with three peer-reviewed architectural innovations:

  1. Manifold-Constrained Hyper-Connections (mHC) — published January 1, 2026

Addresses training instability at trillion-parameter scale. Traditional hyperconnections suffer from broken identity mapping and catastrophic signal amplification that reaches gains of 10³ to 10⁵ in deep networks. mHC solves these stability issues, enabling DeepSeek to train larger models more reliably on the same hardware that would otherwise limit capacity.

The practical output: a 4× wider residual stream adds only 6.7% training time overhead — which is how you scale from 671B to 1T without doubling your training budget.

  1. Engram Conditional Memory — published January 13, 2026 (arXiv:2601.07372)

Separates static knowledge retrieval from dynamic reasoning. Central to V4's breakthrough is the Engram Conditional Memory module, an O(1) lookup system that separates static factual recall from active reasoning. This allows the model to offload syntax and library knowledge to system RAM, preserving precious GPU VRAM for the complex logic required to solve multi-file software engineering tasks.

The benchmark result: Needle-in-a-Haystack accuracy improves from 84.2% to 97% at million-token contexts. That's not a marginal gain — it's the difference between a model that loses the thread in long code reviews and one that doesn't.

  1. Dynamic Sparse Attention (DSA)

Enables the 1M token context window at roughly 50% lower compute than standard attention. DSA achieves this through intelligent sparsity patterns that focus computational resources on the most relevant portions of the context, rather than treating all tokens equally.

The combined effect of these three innovations: V4 moves from V3's 128K context ceiling to 1M tokens while keeping per-token inference costs roughly flat, because the active parameter count stays near the V3 baseline (~32–37B per token).

Here's a clean delta view:

Dimension
DeepSeek V3
DeepSeek V4
Total parameters
671B
~1T
Active per token
37B
~32B
Context window
128K
1M
Memory architecture
Standard KV cache
Engram (O(1) lookup)
Training stability
Standard MoE
mHC-stabilized
Attention mechanism
Multi-Head Latent Attention
Dynamic Sparse Attention
Training tokens
14.8T
Not confirmed
Status
Live, MIT licensed
Pre-release, Q1–Q2 2026

One thing worth flagging: the jump from 671B to 1T total parameters does not mean V4 is proportionally more expensive to run. The active slice stays near 32B. What you're getting from the extra ~330B parameters is more specialist coverage — broader knowledge depth across domains — not proportionally higher inference cost.


The Number That Actually Matters

Here's where I land after going through all of this: "1 trillion parameters" is a marketing number. "32 billion active per token" is an engineering number. The second one is what determines your inference bill, your latency, and your hardware requirements.

What makes V4 genuinely interesting isn't the parameter count — it's that the team managed to get from 671B to 1T total parameters while keeping the active compute footprint roughly flat, and added 1M token context on top of it. That's the actual feat.

Whether V4 delivers on those benchmark claims is a question for independent evals that haven't happened yet. The architecture papers are real. The performance numbers are not yet verified. Plan accordingly.

At Macaron, we turn your model decisions into structured, executable workflows — so you're running AI against your actual tasks, not waiting for the next benchmark cycle. Try it free at macaron.im and judge the results yourself.


FAQ

Q: What are DeepSeek V4's total parameters? Based on leaked architecture code and peer-reviewed papers, V4 is expected to have approximately 1 trillion total parameters. This has not been officially confirmed by DeepSeek as of February 28, 2026.

Q: How many parameters are active per token in V4? Approximately 32 billion, based on the "Top-16 routing" MoE structure described in community architecture analysis. V3's verified active count is 37B; V4's architectural changes (particularly finer-grained expert segmentation) bring this slightly lower while increasing total coverage.

Q: Is 1 trillion parameters actually bigger than GPT-4o? In total parameter count, very likely yes — but total parameters don't directly translate to inference quality or cost. GPT-4o's architecture is undisclosed, so direct comparisons are uncertain. What matters more is active parameters per token (V4: ~32B) and benchmark performance, which won't be verifiable until independent evals drop post-launch.

Q: What's the VRAM requirement for V4 locally? Full precision requires estimated 350–400GB VRAM. Quantized (Q4/INT4) drops significantly and is theoretically runnable on dual RTX 4090s or a single RTX 5090. Engram's DRAM offloading reduces GPU VRAM pressure further. Expect distilled variants (14B–33B) shortly after release for single-GPU deployment.

Q: Does V4 use the same MoE architecture as V3? Same foundational approach, meaningfully extended. Both use sparse routing through expert subsets. V4 adds finer-grained expert segmentation (more, smaller experts), the mHC training stabilizer, Engram memory as a complementary sparsity axis, and Dynamic Sparse Attention for the 1M token context. Think of it as V3's MoE with three additional architectural layers on top.

Q: Why does Engram matter for the parameter story? Engram offloads static knowledge (syntax, library patterns, factual recall) to a hash-based lookup system that doesn't require GPU computation. This means a larger fraction of V4's 1T parameters serve as a retrievable knowledge store rather than active reasoning weights — which is part of why inference costs stay near V3 levels despite the total parameter jump.

Q: When is V4 officially releasing? DeepSeek's mid-February 2026 target window passed without a release. The February 11 silent upgrade to 1M context windows in the production API is the most concrete V4 signal to date. Community consensus now points to Q1–Q2 2026, with no confirmed new date from DeepSeek.

From next article:

Hey, I’m Hanks — a workflow tinkerer and AI tool obsessive with over a decade of hands-on experience in automation, SaaS, and content creation. I spend my days testing tools so you don’t have to, breaking down complex processes into simple, actionable steps, and digging into the numbers behind “what actually works.”

Apply to become Macaron's first friends