
What's up, fellow repo-scrapers — if you've ever stared at a 200-file codebase and wished you could just hand it to the model and ask, this one's for you.
I've been running large-context experiments across DeepSeek's API for months. Multi-file refactors, legacy module audits, dependency tracing across services that nobody documented. And the honest truth I keep landing on: having a big context window is not the same as knowing how to use it. 1M tokens changes what's possible. It doesn't change what's easy.
Here's what I've tested, what broke, and what I'd actually build with before V4 lands on the endpoint.

The context window on DeepSeek's web and app platform was quietly expanded from 128K to 1M tokens on February 11, 2026 — no blog post, no announcement. Three days later they officially confirmed a "long-text model test." Community consensus on r/LocalLLaMA and X narrowed this to V4 infrastructure being tested in production.
That's an 8x increase. And for codebases, the practical implications are real:
With context windows exceeding 1M tokens, DeepSeek V4 can process entire codebases in a single pass. This enables true multi-file reasoning, where the model can understand relationships between components, trace dependencies, and maintain consistency across large-scale refactoring operations.

What this kills: the "which files do I include?" problem. The decision fatigue of manually selecting context. The breakage when a refactor spans six files and you only fed it three.
But here's the thing that kept biting me in testing. A bigger context window doesn't solve the placement problem. A landmark 2023 paper demonstrated that LLMs perform best when relevant information is placed at the beginning or end of the context, and perform significantly worse when it is buried in the middle — the "lost in the middle" problem. Newer models handle this better, but the effect hasn't disappeared. At 1M tokens, burying your core files in position 400K-600K is still a real risk.
The second shift: cost math changes. DeepSeek's Engram memory architecture achieves O(1) retrieval regardless of context length, meaning the cost of processing 1M tokens is projected to be roughly equivalent to processing 128K. That's why the economics work at all. Under a standard attention mechanism, 1M contexts would be prohibitively expensive per query.
This is where most "just dump the whole repo" approaches fail. Not every file deserves equal token real estate in your prompt. Splitting code without respecting structure is the fastest way to break the model's ability to reason about it.
Unlike natural language inputs, software follows a highly structured format and often contains critical dependencies, which makes chunking a delicate operation. For example, splitting code in the middle of a logic routine such as a function definition or control-flow structure would separate valuable context and may severely hamper an LLM's ability to understand the code.
The pattern I've settled on: build a dependency map first, then load in order.
Step 1 — generate a dependency manifest before any prompt:
import ast
import os
from pathlib import Path
def build_dependency_map(repo_path: str) -> dict:
"""
Map each Python file to its local imports.
Returns: {file: [list of local dependencies]}
"""
dep_map = {}
for py_file in Path(repo_path).rglob("*.py"):
rel_path = str(py_file.relative_to(repo_path))
deps = []
try:
tree = ast.parse(py_file.read_text())
for node in ast.walk(tree):
if isinstance(node, (ast.Import, ast.ImportFrom)):
if isinstance(node, ast.ImportFrom) and node.module:
# Convert module path to file path
module_path = node.module.replace(".", "/") + ".py"
if (Path(repo_path) / module_path).exists():
deps.append(module_path)
except SyntaxError:
pass
dep_map[rel_path] = deps
return dep_map
def topological_sort(dep_map: dict) -> list:
"""Sort files so dependencies come before dependents."""
visited = set()
order = []
def dfs(file):
if file in visited:
return
visited.add(file)
for dep in dep_map.get(file, []):
dfs(dep)
order.append(file)
for file in dep_map:
dfs(file)
return order
Step 2 — build your context payload in dependency order:
def build_context_payload(repo_path: str, task_files: list) -> str:
"""
Load files in dependency order.
task_files: the specific files your task touches (go at the END).
"""
dep_map = build_dependency_map(repo_path)
sorted_files = topological_sort(dep_map)
# Foundation first: shared utilities, models, config
foundation_files = [f for f in sorted_files
if any(kw in f for kw in ['utils', 'models', 'config', 'base'])]
# Task-critical files last (highest attention)
ordered_files = foundation_files + [f for f in sorted_files
if f not in foundation_files and f not in task_files]
ordered_files += task_files # task files at the end = highest model attention
chunks = []
for file_path in ordered_files:
full_path = Path(repo_path) / file_path
if full_path.exists():
content = full_path.read_text()
chunks.append(f"### File: {file_path}\n```python\n{content}\n```")
return "\n\n".join(chunks)
Why task files last? Put the most important information first (right after the system prompt) is the general rule — but for code analysis tasks where the task files are the subject of analysis, positioning them at the very end keeps them in the model's highest-attention zone. Foundation context (imports, base classes) goes early. The files you're actually asking about go last.

Ordering isn't just about dependencies — it's about what role each piece plays in the model's working memory.
A more advanced technique is semantic search, which finds code based on conceptual meaning rather than just keywords. A search for "user authentication logic" could find relevant files like auth_controller.py, user_model.js, and session_manager.rb, even if they don't contain those exact keywords.
For large-codebase prompts, I use a three-zone layout:
[SYSTEM PROMPT]
└─ Role + task definition
└─ Output format requirements
[ZONE 1: ARCHITECTURE CONTEXT] (~20% of tokens)
└─ README / architecture docs
└─ config files
└─ data models / schemas
[ZONE 2: DEPENDENCY CHAIN] (~50% of tokens)
└─ Base classes and utilities
└─ Shared modules
└─ Services called by task files
[ZONE 3: TASK FILES] (~30% of tokens) ← highest attention
└─ Files directly relevant to your question
└─ The specific function/class you're asking about
This layout works because LLM performance degrades with longer context inputs due to attention dilution and the "lost in the middle" effect, where models have trouble accessing information buried in the middle of long contexts while still handling the beginning and end reasonably well. Zone 1 (system + architecture) and Zone 3 (task files) get the highest attention naturally. Zone 2 provides the connective tissue.
One pattern that kept breaking my prompts: mixing abstraction levels. Don't interleave high-level architecture docs with low-level implementation files. Group by abstraction. The model reasons better when the layers are distinct.
These are the prompt templates I've tested against DeepSeek's current API (V3.2 with 1M context) and would carry directly into V4.
Template 1: Full-repo audit
You are analyzing a [LANGUAGE] codebase at [REPO_PATH].
Your task: [SPECIFIC TASK — e.g., "identify all locations where user input is passed to database queries without sanitization"]
Architecture overview:
[INSERT README / ARCHITECTURE DOCS]
Codebase (dependency order):
[INSERT SORTED FILE CONTENTS]
Output format:
- Issue: [file:line] — [description]
- Severity: [high/medium/low]
- Fix suggestion: [concrete code change]
List all findings. Do not summarize. If no issues found in a file, skip it.
Template 2: Multi-file refactor
You are refactoring a [LANGUAGE] module.
Context — these files will be affected by the refactor:
[INSERT DEPENDENCY-SORTED FILE CONTENTS]
Refactor task: [SPECIFIC CHANGE — e.g., "migrate all direct DB calls in services/ to use the repository pattern already established in base_repository.py"]
Constraints:
- Do not change public function signatures
- Maintain existing test coverage
- Follow patterns in [REFERENCE FILE]
Output: For each changed file, provide the complete updated file content. Start each with `### Updated: [filename]`.
Template 3: Dependency trace
Trace the full execution path for [FUNCTION/ENDPOINT].
Codebase:
[INSERT FILE CONTENTS]
Task:
1. List every function called, in order of execution
2. For each function, note: file, line number, any external calls (DB, API, cache)
3. Identify any async boundaries
4. Flag any implicit dependencies (global state, env vars, side effects)
Format as a numbered execution trace with file references.
Token budget table — what fits in 1M tokens at rough averages:
The practical ceiling on a mixed repo with docs is around 200–300 files before you start hitting the limits on output quality, not just token count. Reserve 100K–150K tokens for your prompt structure and expected output.
Long-context prompts are exhausting to tune — getting stuck on which template actually works is normal. Macaron lets you save and reuse working prompt templates across sessions, so the version that finally worked on your last codebase audit is ready for the next one. Try it free at macaron.im and test it with a real task.
Q: Can I just dump an entire repo into a single prompt with V4?
Technically yes for repos under ~200–300 files. But "can" and "should" are different questions. Even at 1M tokens, ordering matters — unstructured dumps will underperform structured, dependency-ordered context. The templates above take 10 minutes to implement and consistently outperform raw dumps in my testing.
Q: Does the 1M context mean I don't need RAG anymore?
For codebase tasks where you want the model to reason across the whole repo simultaneously — multi-file refactors, dependency tracing, cross-module bug hunting — yes, raw context often beats RAG. Best for entire codebase analysis, multi-document reasoning: Necessary when the task genuinely requires reasoning across large amounts of text that cannot be easily chunked. Ask yourself: "Does the model need to reason across all of this data simultaneously, or can the task be decomposed?" If it's decomposable, RAG is still cheaper and faster. If it genuinely needs the full picture, load the full picture.
Q: Is the 1M context currently available on the API?
As of March 2026, the 1M context expansion was confirmed on the web and app on February 11, 2026. The API endpoint remains at V3.2 with 128K context. The 1M API access is expected with V4's full release. Build your templates now so you're ready to test the moment it's available.
Q: How do I estimate token count for a repo before sending?
Use tiktoken (OpenAI's tokenizer) as a rough proxy — DeepSeek's tokenizer produces similar counts for code. A quick script:
import tiktoken
def estimate_tokens(text: str) -> int:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
def estimate_repo_tokens(repo_path: str) -> dict:
from pathlib import Path
total = 0
per_file = {}
for py_file in Path(repo_path).rglob("*.py"):
content = py_file.read_text(errors='ignore')
tokens = estimate_tokens(content)
per_file[str(py_file)] = tokens
total += tokens
return {"total": total, "per_file": per_file, "within_1m": total < 900_000}
Q: Will these templates work with V3.2 while I wait for V4?
Yes, with one adjustment: V3.2's 128K API limit means you'll need to selectively load files rather than the full repo. The dependency-sorting logic still applies — just load the top-priority files in the same order. The template structure carries over unchanged.
From next article:
DeepSeek V4 Version History: V3 → V3-0324 → V4 Timeline (2026)
DeepSeek V4 Context Window: 128K vs 1M Tokens
DeepSeek V4 API: Rate Limits, Auth & Quickstart (2026)