
What's up, model architecture nerds — if you've been staring at the "1 trillion parameters" headlines and wondering whether that number actually means anything for your workflow, this one's for you.
I'm Hanks. I test AI infrastructure inside real tasks, not slide decks. And right now, the DeepSeek parameter story is one of the most misread numbers in the space.
Here's the question I kept asking myself: if a model has 671 billion parameters but only uses 37 billion per token, what are you actually paying for — and what does that mean when V4 scales to a trillion?
Turns out, everything. Let me break it down.

Most AI coverage treats "parameter count" as a single number. It isn't. For Mixture-of-Experts models like DeepSeek, there are two numbers that actually matter:
These are not the same thing, and confusing them leads to completely wrong conclusions about cost, speed, and hardware requirements.
Here's the clearest way I've found to think about it: total parameters determine what the model knows. Active parameters determine what it costs to think.
A model with 671B total but only 37B active is not a 671B model in any practical inference sense. It's a 37B model with 671B worth of specialist knowledge available on demand. DeepSeek-V3, with 671B total parameters and 37B active per token, achieves top-tier performance across multiple benchmarks while maintaining training efficiency with a cost of only 2.788 million H800 GPU hours.
That cost figure is the important number. It's what makes the MoE architecture economically viable at this scale — and it's the foundation V4 is built on.
For V3 (the confirmed baseline), the activation ratio is 37B ÷ 671B = 5.5%. Only 1 in 18 parameters fires per token. That's the engineering feat.

The Mixture-of-Experts architecture routes each token through a small subset of specialist networks — called "experts" — instead of passing it through the full model every time. Dense models activate all of their weights on every token. MoE models like DeepSeek, by contrast, only activate a small subset of parameters per token — typically well under 10%.
For V4, the architecture extends this with a "Top-16" routing strategy. DeepSeek V4 is a technical marvel of sparse architecture, utilizing a massive 1-trillion parameter total count while only activating approximately 32 billion parameters for any given token. This "Top-16" routed MoE strategy allows the model to maintain the specialized knowledge of a titan-class system without the crippling latency or hardware requirements usually associated with models of this scale.
A few things worth unpacking in that:
Why 16 experts per token? Routing through more experts increases answer quality but costs more compute. Routing through fewer is faster but narrower. 16 is DeepSeek's current sweet spot — wide enough for complex reasoning, efficient enough for production throughput.
What are "shared generalists"? Not all experts are specialists. V4's MoE stack includes shared expert layers that activate for every token — functioning like a base reasoning layer on top of which routed specialists operate. V4's MoE stack reflects the current frontier in expert-based design: wide models with many small experts, rich per-token mixtures, shared generalists, and robust routing that scales.
Load balancing without auxiliary loss: V3 and V4 both use auxiliary-free load balancing — which means the router learns to distribute tokens evenly across experts without a separate loss term penalizing imbalance. This is a meaningful training stability improvement over earlier MoE designs.
DeepSeek-V3 (671B total / 37B active) was trained on 14.8T tokens. This is the verified V3 baseline. V4's training token count has not been officially confirmed, but the architectural improvements — particularly mHC stability and Engram memory — are specifically designed to make trillion-parameter training viable on the same hardware budget.
The mHC paper (published January 1, 2026, co-authored by DeepSeek founder Liang Wenfeng) addresses a specific failure mode at scale: Traditional hyper-connections can expand residual stream width and improve connectivity patterns, but simultaneously undermine the identity mapping principle that makes residual networks trainable — leading to numerical instability that crashes large-scale training runs. The mHC solution projects connection matrices onto a mathematical manifold using the Sinkhorn-Knopp algorithm, controlling signal amplification.
The practical result: V4 can scale to 1 trillion parameters without the training instability that would otherwise require restarts and burn budget.

Here's the honest comparison table as of February 2026. Note that GPT-4o and Claude's parameter counts are not officially disclosed — the numbers below are widely cited community estimates:
A few things stand out here that most comparisons miss:
GPT-4o's architecture is not public. OpenAI has never confirmed parameter counts. The ~200B dense estimate comes from community reverse-engineering and is frequently wrong. What we know for certain: GPT-4o features a context window of 128K tokens. Input costs $2.50 per million tokens and output costs $10 per million tokens.
Claude's architecture is similarly undisclosed. The SWE-bench performance is documented — Claude Opus 4.5 leads enterprise coding with an 80.9% SWE-bench score and 54% market share among enterprise developers. The parameter count behind that score is not public.
Llama 3.3 70B is a dense model. All 70B parameters activate per token. That's why it's faster on consumer hardware but costs more per token at scale than comparably-performing MoE alternatives.
The real comparison isn't parameters — it's performance-per-dollar. GPT-4o is roughly 29.8x more expensive compared to DeepSeek-V3 for input and output tokens. At matched benchmark performance on HumanEval (~90% for both), that cost delta is the actual competitive story.
This is where the MoE parameter structure has direct practical consequences.
Because only ~37B parameters activate per token (V3 baseline), inference throughput scales differently than a comparably-sized dense model. A 671B dense model would require ~5x the compute per token compared to V3's 37B active slice.
Hardware requirements (V3, confirmed):
V4 projection (pre-release, based on architectural papers):
The Engram contribution here is significant. A 100-billion-parameter embedding table is entirely offloaded to host DRAM with throughput penalties below 3%. That means the effective VRAM pressure for trillion-parameter inference is lower than the total weight count implies.
API pricing (current, verified):
V4 pricing is unconfirmed. Historical pattern: DeepSeek has historically priced aggressively — 20–50× cheaper than OpenAI on comparable models.

V3 is the confirmed baseline. V4 extends it with three peer-reviewed architectural innovations:
Addresses training instability at trillion-parameter scale. Traditional hyperconnections suffer from broken identity mapping and catastrophic signal amplification that reaches gains of 10³ to 10⁵ in deep networks. mHC solves these stability issues, enabling DeepSeek to train larger models more reliably on the same hardware that would otherwise limit capacity.
The practical output: a 4× wider residual stream adds only 6.7% training time overhead — which is how you scale from 671B to 1T without doubling your training budget.
Separates static knowledge retrieval from dynamic reasoning. Central to V4's breakthrough is the Engram Conditional Memory module, an O(1) lookup system that separates static factual recall from active reasoning. This allows the model to offload syntax and library knowledge to system RAM, preserving precious GPU VRAM for the complex logic required to solve multi-file software engineering tasks.
The benchmark result: Needle-in-a-Haystack accuracy improves from 84.2% to 97% at million-token contexts. That's not a marginal gain — it's the difference between a model that loses the thread in long code reviews and one that doesn't.
Enables the 1M token context window at roughly 50% lower compute than standard attention. DSA achieves this through intelligent sparsity patterns that focus computational resources on the most relevant portions of the context, rather than treating all tokens equally.
The combined effect of these three innovations: V4 moves from V3's 128K context ceiling to 1M tokens while keeping per-token inference costs roughly flat, because the active parameter count stays near the V3 baseline (~32–37B per token).
Here's a clean delta view:
One thing worth flagging: the jump from 671B to 1T total parameters does not mean V4 is proportionally more expensive to run. The active slice stays near 32B. What you're getting from the extra ~330B parameters is more specialist coverage — broader knowledge depth across domains — not proportionally higher inference cost.
Here's where I land after going through all of this: "1 trillion parameters" is a marketing number. "32 billion active per token" is an engineering number. The second one is what determines your inference bill, your latency, and your hardware requirements.
What makes V4 genuinely interesting isn't the parameter count — it's that the team managed to get from 671B to 1T total parameters while keeping the active compute footprint roughly flat, and added 1M token context on top of it. That's the actual feat.
Whether V4 delivers on those benchmark claims is a question for independent evals that haven't happened yet. The architecture papers are real. The performance numbers are not yet verified. Plan accordingly.
At Macaron, we turn your model decisions into structured, executable workflows — so you're running AI against your actual tasks, not waiting for the next benchmark cycle. Try it free at macaron.im and judge the results yourself.
Q: What are DeepSeek V4's total parameters? Based on leaked architecture code and peer-reviewed papers, V4 is expected to have approximately 1 trillion total parameters. This has not been officially confirmed by DeepSeek as of February 28, 2026.
Q: How many parameters are active per token in V4? Approximately 32 billion, based on the "Top-16 routing" MoE structure described in community architecture analysis. V3's verified active count is 37B; V4's architectural changes (particularly finer-grained expert segmentation) bring this slightly lower while increasing total coverage.
Q: Is 1 trillion parameters actually bigger than GPT-4o? In total parameter count, very likely yes — but total parameters don't directly translate to inference quality or cost. GPT-4o's architecture is undisclosed, so direct comparisons are uncertain. What matters more is active parameters per token (V4: ~32B) and benchmark performance, which won't be verifiable until independent evals drop post-launch.
Q: What's the VRAM requirement for V4 locally? Full precision requires estimated 350–400GB VRAM. Quantized (Q4/INT4) drops significantly and is theoretically runnable on dual RTX 4090s or a single RTX 5090. Engram's DRAM offloading reduces GPU VRAM pressure further. Expect distilled variants (14B–33B) shortly after release for single-GPU deployment.
Q: Does V4 use the same MoE architecture as V3? Same foundational approach, meaningfully extended. Both use sparse routing through expert subsets. V4 adds finer-grained expert segmentation (more, smaller experts), the mHC training stabilizer, Engram memory as a complementary sparsity axis, and Dynamic Sparse Attention for the 1M token context. Think of it as V3's MoE with three additional architectural layers on top.
Q: Why does Engram matter for the parameter story? Engram offloads static knowledge (syntax, library patterns, factual recall) to a hash-based lookup system that doesn't require GPU computation. This means a larger fraction of V4's 1T parameters serve as a retrievable knowledge store rather than active reasoning weights — which is part of why inference costs stay near V3 levels despite the total parameter jump.
Q: When is V4 officially releasing? DeepSeek's mid-February 2026 target window passed without a release. The February 11 silent upgrade to 1M context windows in the production API is the most concrete V4 signal to date. Community consensus now points to Q1–Q2 2026, with no confirmed new date from DeepSeek.
From next article: