How to Download DeepSeek V4: HuggingFace & Ollama

Blog image

What's up, self-hosters — if your first reaction to "V4 drops this week" was to check your disk space and NVMe speed, you're in the right place.

I've been running DeepSeek's open-weight releases locally since V3. Every new generation I go through the same routine: wait for the GGUF community to figure out the right quant, watch Unsloth do the actual heavy lifting, and figure out what "runs on consumer hardware" actually means versus what the optimistic Reddit thread says. Here's the framework I'd apply to V4 the moment weights drop.

Important upfront caveat: V4 is expected in the first week of March 2026 but has not been officially released as of the time of writing. The download steps below mirror the confirmed V3.x patterns — which V4 will follow — plus size/VRAM estimates based on pre-release architecture specs. Treat file size numbers as estimates until the official repo lands.

3 Ways to Access DeepSeek V4

Before you spend three hours downloading 400GB, pick your method:

Method

Best For

Setup Time

Hardware Needed

Hugging Face

Full control, fine-tuning, research

30–120 min (mostly download)

2+ GPUs or high-RAM workstation

Ollama

Fastest local setup, day-to-day use

5–10 min

Single GPU (24GB+) with Q4 quant

API

Production, no hardware, zero setup

2 min

None

If you're not sure which fits — use the API while you wait for the dust to settle on community GGUF releases. That's what I do on day one.

Option 1 — Hugging Face Download

Blog image

Model Files & Sizes

The V3 (671B) release on HuggingFace totaled 685GB, including 671B of main model weights and 14B of Multi-Token Prediction module weights. V4's trillion-parameter MoE architecture means the full BF16 release will be substantially larger. Based on the parameter scale increase, expect:

Format

V3.2 (actual)

V4 Full (estimated)

V4 Lite (estimated)

BF16 full precision

~1.34 TB

~2–2.5 TB

~400–500 GB

FP8

~685 GB

~1–1.2 TB

~200–250 GB

Q4_K_M (GGUF)

~405 GB

~600–700 GB

~150–200 GB

Q2_K dynamic

~246 GB

~350–450 GB

~80–120 GB

1-bit (UD-TQ1_0)

~170 GB

~240–300 GB

~60–80 GB

For the V3.1 release, Unsloth's GGUF sizes ranged from 170GB (1-bit UD-TQ1_0) up to 1.34TB (BF16 full precision), with Q4_K_M landing at 405GB. Those are your real-world benchmarks for what "community quantized" means at 671B scale. V4 at ~1T parameters scales roughly proportionally.

To download when V4 drops, use huggingface-cli for controlled, resumable downloads:

pip install huggingface_hub hf_transfer
# Enable faster downloads on 1Gbit+ connections
export HF_HUB_ENABLE_HF_TRANSFER=1
# Download a specific quantized file (replace with V4 repo when available)
huggingface-cli download deepseek-ai/DeepSeek-V4-GGUF \
  DeepSeek-V4-Q4_K_M.gguf \
  --local-dir ./deepseek-v4 \
  --local-dir-use-symlinks False
# Or pull all Q4 variants
huggingface-cli download deepseek-ai/DeepSeek-V4-GGUF \
  --local-dir ./deepseek-v4 \
  --include='*Q4_K*gguf'

For the full safetensors release (FP8/BF16), use snapshot_download:

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="deepseek-ai/DeepSeek-V4",  # confirm repo name at launch
    local_dir="DeepSeek-V4",
    allow_patterns=["*.safetensors", "*.json", "*.py"],
    ignore_patterns=["*.pt"],  # skip legacy pytorch checkpoints
)

Watch the deepseek-ai organization on Hugging Face — that's where the official weights will land.

Quantized vs Full Precision

The decision tree I use every time a new DeepSeek release drops:

If you want...

Use this format

Best possible quality, multi-GPU research

BF16 full precision

Good quality, single-node inference (8xA100/H100)

FP8

Single-GPU (24GB) daily use

Q4_K_M or UD-Q4_K_XL GGUF

Minimal VRAM, acceptable quality degradation

Q2_K or UD-TQ1_0 (1-bit dynamic)

CPU + RAM offload (no GPU requirement)

UD-IQ1_S or UD-TQ1_0

The 1-bit dynamic TQ1_0 quant uses ~170GB for V3.1 — and it actually works on a single 24GB GPU card by offloading MoE expert layers to CPU. That's the breakthrough that made consumer-grade DeepSeek self-hosting real. V4's Lite variant is expected to push this further, but the same technique will apply to V4 Full with the right quant.

For local deployment, DeepSeek recommends temperature=1.0 and top_p=0.95 across all V3.x series — carry those settings forward to V4.

Option 2 — Ollama (Easiest)

Blog image

Ollama is the fastest path from "nothing" to "running locally." DeepSeek-V3 on Ollama is 404GB at Q4 quantization with a 160K context window — one command, one download, no configuration.

For V4, the pattern will be identical. When the Ollama library adds support:

# Install Ollama if you haven't already
curl -fsSL https://ollama.com/install.sh | sh
# Pull V4 when available (confirm model tag at ollama.com/library)
ollama pull deepseek-v4
# Or pull a specific quantization (expect tags like :7b, :lite, :q4)
ollama pull deepseek-v4:lite
# Run interactively
ollama run deepseek-v4

Ollama API (OpenAI-compatible, so it drops into any existing integration):

from ollama import chat
response = chat(
    model='deepseek-v4',
    messages=[{'role': 'user', 'content': 'Explain MoE architecture in 100 words.'}],
)
print(response.message.content)

Or use the REST endpoint directly:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "deepseek-v4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

V3 on Ollama requires version 0.5.5 or later. V4 will have its own minimum version requirement — check the model page before running.

The catch with Ollama: you get less control over inference parameters and KV cache behavior. For production or fine-tuning workflows, go Hugging Face. For "I want to ask questions about my codebase without sending data to an API," Ollama is unbeatable.

Option 3 — API (No Download)

If you don't have the hardware, don't want a 400GB download sitting on your NVMe, or need this running in production today — use the official DeepSeek API. No VRAM math required.

The API uses an OpenAI-compatible interface, so switching costs are near zero:

from openai import OpenAI
client = OpenAI(
    api_key="<your DeepSeek API key>",
    base_url="https://api.deepseek.com",
)
response = client.chat.completions.create(
    model="deepseek-chat",  # will point to V4 post-launch
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=1.0,
    top_p=0.95,
)
print(response.choices[0].message.content)

Current API pricing (as of March 2026): $0.28/M input tokens (cache miss), $0.028/M (cache hit), $0.42/M output. The deepseek-chat endpoint currently maps to V3.2 — V4 will appear on this endpoint after launch, with advance notice in the DeepSeek API changelog.

Hardware Requirements

Blog image

Honest numbers. Not "minimum to technically run" — minimum to get usable inference speed.

Setup

V4 Full

V4 Lite (est.)

What you can actually do

8× H100 80GB

✅ BF16

Full precision, fast inference

8× A100 80GB

✅ FP8

Production inference, ~40 tok/s

2× RTX 4090 (48GB)

✅ Q4_K_M

Local dev, ~10–15 tok/s

1× RTX 4090 (24GB)

⚠️ Q2 + CPU offload

✅ Q4_K_M

Usable but slow for V4 Full

CPU only (512GB RAM)

⚠️ 1-bit, very slow

✅ Q4, slow

Background tasks only

MacBook Pro M3 Max (128GB)

⚠️ 1-bit only

✅ Q4

Light use, 2–4 tok/s

The "2× RTX 4090" threshold is the practical self-hosting floor for V4 Full at usable speeds. For V4 Lite, a single 24GB GPU at Q4 quantization is the target — that's the use case it's designed for.

RAM requirement for CPU offload (MoE expert layers): plan for at least 256GB system RAM for V4 Full Q4, 128GB for V4 Lite Q4. The 1-bit dynamic quant for V3.1 works on a single 24GB card by offloading MoE expert layers to CPU, and V4 will support the same technique via llama.cpp's -ot ".ffn_.*_exps.=CPU" flag.

Pick the access method that matches your hardware and workflow, and you should have V4 running in under 30 minutes. If you're building something on top of it and want to save your working prompts, test configurations, or track outputs across sessions — Macaron lets you store and reuse those setups so you're not starting from scratch each time. Try it free at macaron.im.

FAQ

Q: When will V4 weights be available on Hugging Face?

DeepSeek plans to release V4 this week (week of March 2, 2026). Based on V3's release pattern, open weights appeared on Hugging Face within 24–48 hours of the API launch. GGUF community quantizations (Unsloth, TheBloke-style) typically follow within 1–3 days. Watch huggingface.co/deepseek-ai for the official repo.

Q: Can I run V4 Full on a single RTX 4090?

Technically yes, with 1-bit or 2-bit quantization and CPU offload for MoE expert layers. Speed will be 1–3 tok/s, which is usable for offline analysis tasks but painful for interactive use. For interactive inference on a single 4090, wait for V4 Lite.

Q: Does V4 require a special version of llama.cpp or Ollama?

Expect yes. V3 required Ollama 0.5.5+, and V4's new architecture (Engram, DSA Lightning Indexer) will likely require updated inference engine support. Community repos will document the minimum version requirement within days of weight release. Check the model's HuggingFace README before running.

Q: What's the chat template for V4?

V3.2 introduced significant updates to its chat template including a revised tool-calling format and a "thinking with tools" capability, with encoding scripts provided in a dedicated folder on the HuggingFace repo. V4 will include the same — check tokenizer_config.json and the assets/chat_template.jinja file in the official repo. The pattern: <｜begin▁of▁sentence｜>{system prompt}<｜User｜>{query}<｜Assistant｜>.

Q: Is V4 MIT licensed like V3.2?

V3.2 weights are licensed under the MIT License. V4 is expected to follow the same open-weight release under MIT or Apache 2.0, based on reporting and DeepSeek's historical pattern. Confirm in the repo LICENSE file before deploying commercially.

Q: The download keeps timing out. What's the fix?

Enable hf_transfer for faster, more stable large-file downloads: pip install hf_transfer then export HF_HUB_ENABLE_HF_TRANSFER=1. For multi-file safetensors downloads, snapshot_download with allow_patterns handles resumption better than downloading files one by one. If you're outside China and hitting rate limits, consider using a mirror or cloud VM closer to HuggingFace's CDN.

From next article:

DeepSeek V4 Version History: V3 → V3-0324 → V4 Timeline (2026)

DeepSeek V4 Context Window: 128K vs 1M Tokens

DeepSeek V4 API: Rate Limits, Auth & Quickstart (2026)

DeepSeek V4 Architecture: MoE & Latent Attention Explained

How to Build an AI Agent with DeepSeek V4