DeepSeek V4 Architecture: MoE & Latent Attention Explained

Blog image

What's up, architecture nerds — if your first move when a new model drops is to read the technical report before you touch the demo, this is your article.

I'm Hanks. I've been pulling apart LLM architectures for a few years now — not to write about specs, but because understanding the internals tells you where the model will break before you hit it in production. With DeepSeek, that habit has been especially rewarding. The gap between what the benchmarks claim and what the architecture can actually sustain is where you find the real limits.

This is a deep-dive into how DeepSeek V3 actually works — and what the V4 architecture papers tell us is changing. We'll cover MoE routing, Multi-Head Latent Attention, training scale, the training cost debate, and what all of this means if you're planning to fine-tune.

High-Level Architecture Diagram

Blog image

Before going deep, here's the skeleton. DeepSeek V3 (the current production model, base of the V4 lineage) is a decoder-only Transformer with two non-standard components replacing the usual dense attention and feedforward blocks:

Input Tokens
    │
    ▼
Embedding Layer
    │
    ▼
┌─────────────────────────────────┐
│  Transformer Block × N layers   │
│                                 │
│  ┌──────────────────────────┐   │
│  │  Multi-Head Latent       │   │  ← replaces standard MHA
│  │  Attention (MLA)         │   │    compresses KV cache
│  └──────────────────────────┘   │
│            │                    │
│  ┌──────────────────────────┐   │
│  │  DeepSeekMoE             │   │  ← replaces dense FFN
│  │  (shared + routed        │   │    activates 37B of 671B
│  │   experts, top-K routing)│   │
│  └──────────────────────────┘   │
└─────────────────────────────────┘
    │
    ▼
Multi-Token Prediction Head (MTP)  ← auxiliary training objective
    │
    ▼
Output Logits

The two critical innovations — MLA and DeepSeekMoE — were first validated in DeepSeek V2, then scaled up in V3. V4 retains both and adds three new components on top: Manifold-Constrained Hyper-Connections (mHC), Engram Conditional Memory, and an upgraded DeepSeek Sparse Attention (DSA) with a Lightning Indexer.

Mixture-of-Experts (MoE) Explained

MoE is the architecture choice that makes DeepSeek's scale/cost ratio possible. Here's the core idea: instead of every token passing through every parameter in the network, a routing mechanism selects a subset of "experts" — specialized feedforward subnetworks — to activate for each token.

DeepSeek V3 has 671B total parameters, but only 37B are activated per token. That ratio — roughly 5.5% of parameters active at any given moment — is the reason inference is fast and cheap relative to a dense model of comparable capability.

Number of Experts & Routing

V3's DeepSeekMoE architecture includes 256 routed experts plus shared experts, with a sigmoid gating mechanism and learnable expert bias for routing.

The routing logic per token works as follows:

The token's hidden state is scored against all 256 expert embeddings
Top-K experts are selected (typically top-2 or top-4 depending on layer)
The token is dispatched to those experts; outputs are combined weighted by affinity scores

One problem in MoE routing is load imbalance — if all tokens keep picking the same few popular experts, you get routing collapse and wasted capacity. DeepSeek's solution is a technique called auxiliary-loss-free load balancing.

Rather than adding an auxiliary loss to penalize imbalanced routing (which degrades model quality), V3 adds a learnable bias term to each expert's affinity score. During training, the bias for overloaded experts is decreased and for underloaded experts is increased, dynamically steering routing toward balance without touching the loss function.

In deployment, DeepSeek also runs redundant experts — during inference, high-load experts are detected and duplicated across GPUs. For V3's prefilling stage, 32 redundant experts are deployed, allowing each GPU to host one additional copy of a hot expert beyond its normal 8, balancing throughput without increasing cross-node communication overhead.

Why MoE Cuts Inference Cost

The math is straightforward:

Architecture

Total Params

Active Params/Token

FLOPs/Token (relative)

Dense model, same quality

~100B

100B

1.0×

DeepSeek V3 MoE

671B

37B

~0.37×

DeepSeek V4 MoE (estimated)

~1T

~32B

~0.32×

You get a model that behaves like a massive dense network — because it's seen all 671B parameters during training — but pays the compute cost of a ~37B model at inference time. That's why DeepSeek can price its API at $0.28/M input tokens (cache miss) while comparable dense models cost 10× more.

V4 reportedly goes further: despite V4's total parameter count growing ~50% over V3 to ~1T, active parameters per token are estimated at ~32B — lower than V3's 37B. This means V4 should be cheaper to run per token than V3 while being significantly more capable.

Multi-Head Latent Attention

Blog image

Standard multi-head attention has a KV cache problem. For every token generated, the model stores key and value vectors for every previous token in every attention head. At 128K tokens with many layers, that's tens of gigabytes of GPU memory — and it scales linearly with sequence length.

MLA addresses this through joint low-rank compression of keys and values. Rather than caching full-size K and V tensors, MLA compresses them into a lower-dimensional latent space before storage. At inference time, the compressed tensors are projected back to their original dimensions before use. Only the small compressed vectors need to be cached — not the full-size K/V matrices.

The result, as noted in the DeepSeek-V3 technical report: MLA reduces KV cache memory by 93.3% compared to standard multi-head attention. For long-context inference, this is the difference between a model that can sustain 128K context in production versus one that runs out of GPU memory before getting there.

The precise formulation:

For token t with hidden state h_t:
Standard MHA:
  K_t = W_K · h_t     (cache full K_t per layer)
  V_t = W_V · h_t     (cache full V_t per layer)
MLA:
  c_t = W_down · h_t  (compress to low-rank vector c_t)
  (cache only c_t — much smaller)
  At attention time:
  K_t = W_K_up · c_t  (project back up)
  V_t = W_V_up · c_t  (project back up)

One wrinkle: RoPE (Rotary Position Embedding) needs to be applied to keys for positional information. MLA handles this with a separate decoupled key component (W_KR) that carries the rotational position encoding, leaving the main compressed vector position-agnostic and thus fully cacheable.

For V4, the Engram Conditional Memory module builds on top of MLA by adding a separate O(1) hash-based retrieval layer for static factual lookups — keeping the MLA attention budget free for complex reasoning rather than spending it on simple fact recall.

Training Data & Token Count

DeepSeek V3 was pre-trained on 14.8 trillion diverse, high-quality tokens. The composition spans web text, code, books, and academic content — with deliberate emphasis on mathematical and programming data in later training stages.

Post-pre-training pipeline:

Stage

GPU Hours

Notes

Pre-training (14.8T tokens)

2,664K

Main training run

Context length extension

119K

YaRN-based 128K extension

Post-training (SFT + RL)

GRPO-based RL, R1 distillation

Total

2,788K

Official figure, final run only

The post-training stage includes a notable technique: DeepSeek distilled reasoning capabilities from one of the DeepSeek R1 series reasoning models into V3 during post-training. The pipeline incorporates verification and reflection patterns from R1 to improve V3's reasoning performance while maintaining control over output style and length.

For V4, training data specifics haven't been published. Given the jump to ~1T parameters and Engram's O(1) hash embeddings covering n-gram patterns, the training corpus almost certainly exceeds V3's 14.8T tokens — but this remains unconfirmed.

Training Cost vs GPT-4

This is the number that melted Nvidia's market cap in January 2025. Let's get the facts right.

What DeepSeek actually claimed:

At the reported rental price of $2/GPU hour, the full V3 training run cost approximately $5.576M total: $2.664M for pre-training, $0.238M for context extension, and $0.01M for post-training.

What that figure excludes:

The published figure covers only the final training run. Hardware purchase (the 2,048-GPU H800 cluster) is estimated at north of $51 million. Data acquisition, data cleaning, research and development, and failed experiments are not included.

The accurate comparison:

Model

Final Training Run Cost

Full Amortized Cost (est.)

DeepSeek V3

~$5.6M (reported)

$50M–$100M+ (hardware + R&D)

GPT-4

~$100M (reported)

Not disclosed

Llama 3 405B

~$61.6M (30.84M GPU hours)

Not disclosed

The real insight isn't "$5.6M vs $100M." It's that DeepSeek trained a GPT-4-class model using H800s — the export-restricted, downgraded version of the H100 — by optimizing the hell out of their training infrastructure. This included FP8 mixed-precision training to cut compute overhead, DualPipe for pipeline parallelism with reduced bubbles, and custom all-to-all communication kernels to work around H800's inferior interconnect bandwidth compared to H100s.

The efficiency gap versus Llama 3 is the cleaner comparison: Llama 3 405B used 30.84M GPU hours vs DeepSeek V3's 2.788M — roughly 11× more compute for a comparable-scale model.

V3 → V4 Architecture Changes

Blog image

V4 (expected Q1 2026, not yet officially released as of March 2026) adds three published architectural innovations on top of V3's MLA + DeepSeekMoE foundation:

Manifold-Constrained Hyper-Connections (mHC)

Traditional residual connections allow each layer's output to be added back to the residual stream. Hyper-connections widen this: they allow information to flow across multiple layers simultaneously, increasing connectivity without adding depth. The problem at trillion-parameter scale is numerical instability — unconstrained hyper-connections can amplify signals by 3,000× across layers, crashing training runs.

mHC solves this using the Sinkhorn-Knopp algorithm to project connection matrices onto a mathematical manifold, constraining signal amplification to 1.6× versus 3,000× with unconstrained methods. The practical result: a 4× wider residual stream adds only 6.7% training time overhead.

Engram Conditional Memory

Engram, introduced in a January 13, 2026 research paper, separates static knowledge retrieval from dynamic reasoning. Static facts (entity names, fixed phrases, lookup patterns) are stored in an O(1) hash-based memory layer. Dynamic reasoning continues through the MoE attention stack as normal.

The paper identifies a Sparsity Allocation Law: under a fixed sparse parameter budget, the optimal split is approximately 20–25% memory (Engram) and 75–80% computation (MoE). Allocating less than 20% to memory causes the model to waste compute rediscovering patterns; allocating more than 25% starves the reasoning capacity.

For practical deployments, this means the model can pull a factual lookup from memory without burning attention head capacity — leaving the expensive reasoning stack for tasks that actually require it.

DeepSeek Sparse Attention (DSA) with Lightning Indexer

DSA reduces computational overhead for long-context processing by approximately 50% compared to standard attention, using intelligent sparsity patterns that focus resources on the most relevant tokens rather than attending uniformly across the full context.

V4's upgraded version adds a Lightning Indexer — a fast preprocessing step for million-token context that selects high-value token subsets before the main attention computation runs.

V3 → V4 summary table:

Component

Attention

MLA

MLA + Engram memory layer

FFN

DeepSeekMoE (256 experts, top-2/4)

DeepSeekMoE (scaled, 16 experts/token)

Residual

Standard

mHC (manifold-constrained)

Long-context

YaRN + DSA (128K)

DSA + Lightning Indexer (1M)

Active params

37B

~32B (estimated)

Total params

671B

~1T (estimated)

Implications for Fine-Tuning

V3's open weights under MIT license changed what's possible for fine-tuning a frontier-class model. V4 is expected to follow the same pattern.

The core challenge with full fine-tuning:

The full 671B V3 model, even in FP8 format, occupies approximately 671GB of GPU memory — exceeding the 640GB memory of a single H100 host. V3 also has 256 different MoE experts, which adds routing complexity during training that standard fine-tuning frameworks aren't optimized for.

LoRA is the practical path:

LoRA (Low-Rank Adaptation) freezes the base model weights and injects small trainable adapter matrices into the attention layers. For DeepSeek V3, the target modules are q_proj, k_proj, v_proj, and o_proj — the same attention projection matrices as in standard Transformer architectures.

A minimal LoRA config for DeepSeek V3/V4:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
lora_config = LoraConfig(
    r=32,                    # Adapter rank — higher = more capacity, slower
    lora_alpha=64,           # Scaling factor (typically 2× r)
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V3",
    load_in_4bit=True,       # 4-bit quantization for memory efficiency
    device_map="auto"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~83M || all params: ~671B || trainable%: ~0.012%

The FP8 precision trap:

DeepSeek V3 is natively trained and designed for FP8 serving. LoRA weights, however, are typically assumed to be in unquantized bfloat16 at inference time. Merging LoRA weights trained in bfloat16 and then serving in FP8 introduces a significant accuracy penalty — even though the base model was trained natively in FP8. The solution is Quantization-Aware Training (QAT), which simulates FP8 numerics during LoRA training to align training and serving precision.

Fine-tuning tiers by resource budget:

Approach

GPU Requirement

When to Use

LoRA on distilled model (7B/14B)

Single RTX 4090 (24GB)

Behavior shaping, style adaptation

LoRA on V3 full model (INT4)

8× H100 (640GB+)

Domain adaptation, instruction tuning

Full fine-tune V3

32+ H100s

Rarely justified; use only if LoRA fails

V4 LoRA (estimated)

8–16× H100 (FP8 quantized)

Same use cases as V3 LoRA

For most teams, the practical path is: fine-tune on a DeepSeek distilled model (7B or 14B) for behavior changes, validate the adapted outputs, then evaluate whether the improvement holds at V3/V4 full-model scale. The distilled models share the same architectural family and the behavior transfer is usually strong.

At Macaron, the workflows that break most reliably on long-context models aren't the ones that hit the token limit — they're the ones where the model loses the thread across a long conversation, and you have to re-feed context you've already established. If you're building on top of DeepSeek V3 or planning for V4 and want a memory layer that persists relevant context between sessions without re-injecting your full document set on every call, that's exactly the problem Macaron is built for — try running a real multi-session task and see if the context holds.

FAQ

Q: What is the core architectural difference between DeepSeek V3 and a dense model like GPT-4? DeepSeek V3 uses Mixture-of-Experts (MoE) instead of a dense feedforward layer. Where a dense model activates all parameters for every token, V3 routes each token through a selected subset of 256 specialized expert networks — activating only ~37B of 671B total parameters per token. This cuts inference compute while maintaining the representational capacity of a much larger model.

Q: Does Multi-Head Latent Attention hurt model quality compared to standard multi-head attention? Based on published benchmarks, no measurable quality degradation versus standard MHA. MLA was first validated in DeepSeek V2 and carried forward to V3, with both models showing competitive performance against much larger or more expensive models. The KV cache compression is a pure engineering win — lower memory, same effective attention quality.

Q: How is V4's Engram memory different from standard RAG (Retrieval-Augmented Generation)? RAG retrieves from an external vector database at query time, requiring an embedding lookup, approximate nearest-neighbor search, and context injection as separate steps. Engram is a native module inside the model — it stores n-gram patterns as hash embeddings and retrieves them in O(1) time during the forward pass, without any external system. Think of it as a built-in lookup table that coexists with the neural computation stack, rather than a preprocessing step.

Q: Can I fine-tune DeepSeek V3 on a single GPU? Not the full 671B model. Even in FP8 at 1 byte per parameter, the base model alone requires ~671GB of memory. You can fine-tune the smaller distilled variants (7B, 14B) on a single RTX 4090 using LoRA with 4-bit quantization. Full V3 LoRA requires at minimum 8× H100s.

Q: What is the "auxiliary-loss-free" load balancing in V3's MoE? Traditional MoE models add an auxiliary loss term during training to penalize routing imbalance — but too large an auxiliary loss degrades model quality. DeepSeek V3 instead adds a bias term to each expert's routing score. Overloaded experts get a reduced bias (making them less likely to be selected), underloaded experts get an increased bias. The model learns to balance without any direct loss signal for it, preserving quality while maintaining efficient routing.

Q: Is the $5.6M training cost figure for V3 real? The figure is accurate for what it measures: the final training run on 2,788K H800 GPU hours at $2/hour rental pricing. It excludes hardware purchase (estimated $51M+), prior research, ablation experiments, data costs, and failed runs. The honest framing is that DeepSeek achieved comparable model quality to much higher-cost competitors on their final run — not that the total company investment was $5.6M.

Q: When will V4 be available via the API? As of March 2026, V4 has not officially launched. The February 11 context window expansion to 1M tokens is widely interpreted as infrastructure preparation for V4, but no official API model string has been published. Watch the DeepSeek API changelog for the announcement.

From next article:

DeepSeek V4 vs R1: Which Model Should You Actually Use?

DeepSeek V4 Parameters: 671B MoE Architecture Explained

DeepSeek V4 Benchmarks: MMLU, HumanEval & SWE-bench

DeepSeek V4 Version History: V3 → V3-0324 → V4 Timeline (2026)

DeepSeek V4 Context Window: 128K vs 1M Tokens