ChatGPT’s 3rd Anniversary Gift – DeepSeek V3.2 Series Challenges GPT-5 and Gemini

Author: Boxu Li

Three years after ChatGPT’s debut, a new open-source contender has arrived as a birthday present for the AI community. DeepSeek-V3.2 and DeepSeek-V3.2-Speciale – two newly released large language models – are pushing the boundaries of open AI systems. Developed by Chinese AI lab DeepSeek, these models aim to deliver GPT-5-level reasoning performance, rivaling cutting-edge closed models like Google’s Gemini-3.0-Pro[1][2]. Both models and an in-depth technical report have been open-sourced, giving researchers and developers a closer look at how far open models have come.

DeepSeek-V3.2: Daily Driver at GPT-5 Level Performance

DeepSeek-V3.2 is designed as a balanced “daily driver” model – one suitable for general question-answering, coding assistance, and AI agent tasks in real applications. According to DeepSeek’s benchmarks, V3.2’s reasoning capabilities match the level of GPT-5 on public reasoning tests and are only slightly behind Gemini-3.0-Pro[1]. In practical terms, that means V3.2 can handle complex logical and analytical questions nearly as well as the best closed models today. Notably, V3.2 produces much more concise outputs than some prior open models (like Kimi-K2-Thinking), reducing token usage and user wait time without losing reasoning depth[3].

Under the hood, DeepSeek-V3.2 has 685 billion parameters activated per token (out of a 670B MoE architecture) – but it’s optimized for efficiency and long-context use. It supports an extended 128K token context window, enabling analysis of hundreds of pages of text in one go. Despite its size, V3.2 has been fine-tuned to integrate reasoning with external tool use. In fact, it’s DeepSeek’s first model that can “think” in the process of calling tools. It supports both a chain-of-thought mode and a standard mode when using tools, allowing it to reason through multi-step tool-augmented tasks (like using calculators, code interpreters, or search engines) in a structured way. This makes V3.2 especially powerful for agent applications – from coding assistants that run code to conversational agents that browse the web.

V3.2-Speciale: Extreme Reasoning, on Par with Gemini Pro

For users who need even more reasoning firepower, DeepSeek released V3.2-Speciale alongside the standard model. The Speciale variant pushes open-source reasoning to the extreme, incorporating an extended “thinking” mechanism and even integrating a dedicated math theorem-proving module (from the DeepSeek-Math-V2 model). The result is a model tuned for highly complex problem solving – “exploring the boundaries of model capability,” as the developers put it[4]. On rigorous logic and math benchmarks, DeepSeek-V3.2-Speciale’s performance is comparable to Gemini-3.0-Pro[4], essentially matching the state-of-the-art in those domains.

This claim is backed up by Speciale’s achievements in prestigious competitions: it reportedly achieved gold-medal level results on the International Math Olympiad (IMO 2025), the Chinese Math Olympiad (CMO 2025), the ICPC 2025 World Finals (programming), and the IOI 2025 (informatics)[5]. In fact, on the ICPC coding contest, V3.2-Speciale’s performance equaled that of a human silver medalist (2nd place), and on IOI it was on par with a top-10 human competitor[5]. These are remarkable feats for an AI model, demonstrating reasoning and problem-solving abilities at elite human levels.

It’s worth noting that Speciale is an expert-focused model. It excels at long-form reasoning (e.g. detailed proofs, multi-step logic, complex programming challenges), but it is not optimized for casual chat or creative writing. It’s also more expensive to run – Speciale tends to consume significantly more tokens to arrive at its answers[6]. For now, DeepSeek is only providing V3.2-Speciale via a limited research API (with no tool-use enabled) and cautioning that it’s meant for academic or high-stakes reasoning tasks rather than everyday conversation.

Efficient Reasoning via Sparse Attention (DSA)

One of the key innovations enabling DeepSeek-V3.2’s performance is a new attention mechanism called DeepSeek Sparse Attention (DSA). Traditional Transformer models pay a quadratic cost as context length grows, because every token attends to every other token. DSA breaks this bottleneck by using a fine-grained sparse attention pattern[7]. It introduces a “lightning indexer” component that quickly estimates relevance scores between the current token and past tokens, then selects only the top-$k$ most relevant tokens to attend to[7]. In essence, the model learns to ignore irrelevant context and focus only on the important parts of a long sequence.

This sparse attention design slashes the computation needed for long sequences from O(L²) down to O(L·k), with k much smaller than L. In DeepSeek’s implementation, k=2048 was used (each token attends to 2048 selected past tokens) during the second stage of training. The team employed a two-phase training strategy for DSA: first a dense warm-up where the lightning indexer was trained alongside full attention for a few billion tokens, to ensure it learned to mimic full attention’s behavior. Then the model was switched to sparse mode and trained on hundreds of billions more tokens with the top-$k$ constraint in place. The result is a huge efficiency gain with no loss in accuracy. In fact, V3.2-Exp (the experimental precursor to the final model) performed on par with V3.1-Terminus across a battery of benchmarks, despite using the new sparse attention[8].

Practically, DSA means long documents are no longer a burden. Internal tests showed up to 2–3× faster processing on 128K-length inputs and about 30–40% lower memory usage[9]. Costs drop dramatically as well. DeepSeek reported that for 128K contexts on their H800 cluster, the prompt (prefill) cost per million tokens fell from ~$0.70 to ~$0.20, and generation cost from ~$2.40 to ~$0.80 – a 3× reduction in long-context inference cost. In the public API, these savings have translated to over 50% lower pricing for users[10]. In short, DSA allows V3.2 to handle extremely long inputs at a fraction of the time and cost of previous models, without compromising output quality.

Reinforcement Learning at Scale: GRPO and Expert Distillation

Another major factor in DeepSeek-V3.2’s strong performance is the massive reinforcement learning (RL) fine-tuning that went into it. The DeepSeek team invested an unprecedented amount of compute into post-training RL – exceeding 10% of the compute used in pre-training (which itself is huge for a 670B-scale model). This is highly unusual in open-source AI, where RL fine-tuning budgets are typically much smaller. The rationale is that while pre-training teaches broad knowledge, intensive RL can unlock advanced capabilities by aligning the model with complex objectives (like solving multi-step problems, using tools, or adhering to instructions under constraints)[2].

To scale up RL safely, DeepSeek built on their custom Group Relative Policy Optimization (GRPO) algorithm. They introduced several stability and efficiency improvements in this RL pipeline:

· Unbiased KL Estimation: The team fixed issues in the original K3 estimator used for KL-divergence penalties, eliminating systematic bias that could lead to unbounded gradient updates. This prevented training instabilities that can occur when the policy drifts too far from the reference policy.

· Offline Sequence Masking: Because RL training often generates large batches of “rollout” data that are then reused across many gradient updates (an off-policy scenario), DeepSeek computed the KL divergence between the rollout policy and the current policy for each sample. If a generated sequence’s policy had strayed too far from the current model, that sequence was masked out (excluded) from training updates[11][12]. This clever trick ensured the model mostly learned from on-policy or near on-policy data, improving stability and preventing bad trajectories from skewing learning.

· Keep Routing for MoE: DeepSeek’s models use a Mixture-of-Experts architecture, which means different “experts” (sub-networks) handle different tokens. A challenge here is that slight differences between inference and training implementations could cause different experts to be chosen for the same input, leading to inconsistency. DeepSeek addressed this by capturing the expert routing decisions during inference and forcing the same expert routes during RL updates. This “Keep Routing” method ensured the parameters adjusted during RL correspond to the same experts that would be used at inference, avoiding any nasty surprises from expert shuffling.

On top of these algorithmic tweaks, the data regime for RL was very ambitious. DeepSeek trained a series of specialist models – each focused on a particular domain or skill – and then distilled knowledge from all of them into V3.2. For example, they fine-tuned domain-specific experts for mathematics (proofs), programming, logical reasoning, general tool-augmented tasks, code-based agents, and search-based agents. Each of these specialist models was trained in both a “thinking” (chain-of-thought) mode and a “non-thinking” mode as needed. Using these experts, DeepSeek generated a huge synthetic dataset of high-quality demonstrations in each domain, which was then used to supervise the final V3.2 model. This expert-distillation pipeline supplied V3.2 with rich training signals across 85,000+ complex instructions, covering everything from step-by-step math proofs to software debugging sessions.

Enhanced Agent Capabilities and Tool Use Integration

One of DeepSeek-V3.2’s headline features is its much improved agent capabilities – essentially, the model’s ability to plan, reason, and use tools in a multi-step loop to solve problems. Earlier versions of DeepSeek’s reasoning model had a major limitation: if the model was in “thinking mode” (i.e. producing a chain-of-thought), it couldn’t call external tools, and vice versa. V3.2 removes that barrier. It is the first DeepSeek model that fully integrates thinking with tool use, meaning it can maintain an internal reasoning chain while also issuing tool calls (e.g. running code, searching the web) mid-dialogue[13]. This yields much more powerful and flexible agent behavior.

To support this, the DeepSeek team re-imagined how the model’s context management works for multi-turn tasks. In V3.2, the model’s reasoning traces (the “thoughts”) are preserved across a sequence of tool calls, instead of being wiped at each step. Only when a new user query arrives does the system reset the reasoning context (while still retaining the relevant tool interaction history in the conversation)[14][15]. This approach saves a lot of tokens and lets the model build up a persistent chain-of-thought for a problem while iteratively invoking tools. For example, if the user asks a complicated coding question, the model can think through the steps, call a Python interpreter to test some code, continue thinking based on the result, perhaps call a documentation search tool, and so on – only finalizing its answer when it has verified a correct solution. All interim reasoning remains available to the model until the task is done.

DeepSeek also gave the model a “cold start” prompt that explicitly encourages this behavior. The system instructions nudge the model to first output a detailed reasoning process (marked with special tokens) before revealing the final answer, especially for complex tasks like programming challenges. This prompt engineering ensures V3.2 knows it should engage its chain-of-thought and tool abilities for difficult queries, rather than jumping straight to an (often flawed) answer.

Perhaps the most impressive aspect of V3.2’s agent skillset comes from how it was trained. The team constructed an automatic environment synthesis pipeline to create realistic, challenging scenarios for the model to learn from. They generated 1,827 interactive task environments paired with 85,000+ complex instructions for the model to solve[16]. Crucially, these tasks were designed to be “hard to solve, easy to verify.” In other words, the model is presented with problems that have a large search space (difficult to find a solution by chance) but a clear criteria to check a solution. This property makes them ideal for reinforcement learning: the model can experiment (or use a tool) to propose a solution and then quickly verify whether it meets all the given constraints.

For example, one synthesized task was a three-day travel itinerary planning problem with multiple constraints (don’t repeat cities, adjust budgets dynamically based on hotel costs, etc.). It’s extremely hard for a model to just guess a valid itinerary because the constraints create a combinatorial problem – but if the model comes up with a candidate itinerary, it’s straightforward to verify if all constraints are satisfied. By training on many such tasks (spanning domains like travel planning, scheduling, logical puzzles, and more), V3.2 learned to better handle problems that require search, optimization, or multi-step reasoning. This training regimen has greatly improved the model’s generalization to new, unseen agent tasks.

In the realm of coding agents, DeepSeek tapped into GitHub – mining millions of real issue threads and pull requests. They automatically constructed tens of thousands of executable coding challenge environments from this data. The model could practice reading a bug report or feature request, then navigating a codebase (with tool assistance) to implement a fix or feature. These environments covered multiple programming languages (Python, Java, JavaScript, etc.), exposing the model to a wide variety of software problems. A separate pipeline handled search-based QA agents: using a multi-agent simulation, DeepSeek generated datasets where one agent posed tough questions about long-tail entities and another agent (with access to a search tool) had to find and verify the answers. This multi-step generation (question construction → web search → answer validation) yielded high-quality training examples for teaching V3.2 how to be an effective “research assistant.”

Thanks to these efforts, DeepSeek-V3.2 has made a breakthrough in tool-using agent tasks. On internal evaluations, V3.2 achieved the highest scores of any open model on a suite of agent benchmarks, significantly closing the gap with closed models[17]. The developers highlight that V3.2 was not explicitly tuned to the specific tools in those tests – suggesting its agent skills transfer to real-world scenarios, not just narrow benchmarks[18]. In other words, the model learned how to reason and use tools in general, rather than overfitting to particular tasks.

Performance Benchmarks and Comparison

How do DeepSeek’s new models stack up against the best AI systems on the market? The technical report and early analyses provide some answers. Broadly, DeepSeek-V3.2 delivers top-tier performance in mathematical reasoning and coding tasks, and V3.2-Speciale even rivals the very best on complex reasoning – but there remain areas (like open-ended tool use) where closed models still hold an edge. Below is a snapshot of selected benchmark results that illustrate the competitive landscape:

Table 1: Performance on Sample Reasoning Benchmarks (Accuracy%)

Benchmark (2025)

OpenAI GPT-5.1 Pro

Google Gemini-3.0-Pro

DeepSeek-V3.2

DeepSeek-V3.2-Speciale

AIME (Math Olympiad)

~94.6% (est.)

~95.0% (est.)

93.1%

96.0%[4]

HMMT (Math Tournament)

88.3%

97.5%

92.5%

99.2%[4]

GPQA (Science QA, hard)

85.7%

91.9%

82.4%

85.7%

<small>Sources: DeepSeek technical report[4]. GPT-5.1 and Gemini results are approximate values from the report’s graphs. Speciale often matches or exceeds Gemini on math tasks, while standard V3.2 is at GPT-5 level, slightly below Gemini.</small>

As we can see, DeepSeek-V3.2 lives up to its promise on academic reasoning challenges. On math contests like AIME and HMMT, V3.2’s accuracy is in the same ballpark as an advanced GPT-5 model, and only a few points shy of Gemini’s state-of-the-art scores. The Speciale model even outperforms Gemini on those math benchmarks[4], demonstrating the payoff of its enhanced “long thinking” approach. These results are striking – math and formal reasoning were long considered a weakness of open models, but V3.2 shows that open-source systems can achieve frontier-level performance in this domain[19].

On the coding side, DeepSeek-V3.2 also shines, though the competition is fierce. In the SWE-Bench Verified test (which checks if a model can produce bug-fixing code diffs that pass unit tests), V3.2 scored ~73%, significantly surpassing its predecessor (V3.1 scored ~66%[20]) and roughly on par with other top open models like Moonshot’s Kimi K2 and Alibaba’s Qwen-3. In fact, all these open models slightly outperform OpenAI’s older 120B baseline on this coding benchmark[21][22]. This underscores how far open models have progressed in practical coding ability. DeepSeek V3.2 can reliably fix real bugs and generate working code, making it extremely useful for developer assistance.

However, against the absolute best closed models, the picture is mixed. On certain coding tasks, GPT-5.1 still holds an advantage. For instance, in the more complex Terminal-Bench 2.0 (which evaluates multi-step CLI tool use and coding in an agent loop), early reports indicate GPT-5 and even Anthropic’s Claude outperform DeepSeek, especially in sustained reliability over long tool-using sessions[23]. DeepSeek-V3.2’s accuracy drops on those intricate multi-step agent tasks, reflecting that while it’s very capable, it isn’t yet the top performer when it comes to fully autonomous coding agents or long-horizon problem solving. Similarly, on comprehensive tool-use benchmarks like MCP-Universe and Tool-Decathlon, V3.2 trails well behind GPT-5 and Gemini[24]. OpenAI and Google’s systems still execute complex, multi-tool plans more consistently. The gap has narrowed – V3.2 reached new highs for open models on these tests[17] – but a sizable margin remains before open models can truly match closed ones in general tool-use proficiency.

In summary, DeepSeek-V3.2 delivers near-frontier performance in many areas. It’s competitive with GPT-5 on real-world coding tasks and even rivals Gemini on advanced math reasoning[19]. At the same time, it is not a outright replacement for GPT-5 or Gemini across the board – especially in ultra-complex “agent” scenarios involving elaborate tool orchestration, where those closed models still have an edge[25][24]. This balanced view is important for setting expectations: V3.2 excels in what it was optimized for (reasoning and coding with efficiency), while the Speciale variant shows what is possible when pushing reasoning to the limit.

Limitations and Outlook

Despite the impressive achievements, the DeepSeek team is candid about certain limitations of the V3.2 series. First, because the total training FLOPs (floating-point operations) are still less than some ultra-large closed models, world knowledge breadth and memorization of rare facts in V3.2 may lag behind leaders like GPT-5. In other words, it might not know some obscure trivia or domain-specific info that larger proprietary models have absorbed. This is a common trade-off in open models, which often have to train on slightly smaller or less diverse corpora.

Another challenge is token efficiency. DeepSeek notes that both V3.2 and Speciale sometimes need to generate longer reasoning chains to reach the same answer quality that a model like Gemini-3.0-Pro can achieve with a more concise response[6]. In practice, this means using V3.2 in its “thinking mode” might incur a higher token cost (and latency) to solve extremely difficult problems – the model will be verbose as it works through the steps. Speciale in particular, while extraordinarily capable, is token-hungry: it might produce a very detailed proof or explanation where a human expert or a refined closed model could give a tighter answer. This is not always a downside (the thorough reasoning can be valuable), but it does make certain uses more costly.

DeepSeek-V3.2 also currently lacks fine-tuning for open-ended conversational finesse or creative writing. The focus of its training was clearly on structured problem solving and agents. Users have observed that its style is logical and informative, but perhaps less naturally chatty or imaginative compared to models like GPT-4 or Claude in casual dialogue. This was a conscious choice: DeepSeek prioritized research tasks, coding, and math abilities for this release, even if it meant some drop in general chattiness.

Looking forward, the DeepSeek team has hinted at continued progress. The V3.2 technical report openly discusses these shortcomings as targets for future improvement. There is already community anticipation for a potential DeepSeek-R2 model – which, if the naming holds, could be the next reasoning-centric model building on R1 and V3.2’s foundations. (DeepSeek’s followers half-jokingly pleaded “When will R2 arrive?!” in response to the V3.2 launch.) If and when R2 comes, the expectation is that it might further close the gaps, perhaps by incorporating even larger training runs, more knowledge infusion, and improved token efficiency techniques.

For now, DeepSeek-V3.2 represents a milestone in the open-source AI world. It demonstrates that with clever engineering – from sparse attention to massive RL fine-tuning and synthetic task generation – an open model can reach frontier performance on reasoning and coding, areas once thought to be the guarded domain of trillion-parameter closed models. As one analyst put it, V3.2 is “a strong, low-cost thinking and coding model that delivers frontier-level results where most developers actually work: code and math”[26]. It might not dethrone GPT-5 or Gemini as the universal AI solution, but in its specialized role, DeepSeek-V3.2 succeeds spectacularly[27] – and crucially, it does so as a freely available model. In the broader AI ecosystem, that is a priceless gift indeed on this anniversary of ChatGPT.

Sources: The information and quotes in this article are drawn from DeepSeek’s official release notes and technical report[1][4][13][17], news coverage and analyses in AI publications[2], as well as independent evaluations of DeepSeek-V3.2 by early users[19][24] and community experts[7][8]. All benchmarks and comparisons reflect the current state (Dec 2025) of model performance on the respective tasks.

[1] [3] [4] [5] [6] [13] [14] [15] [16] [17] [18] DeepSeek V3.2 正式版：强化 Agent 能力，融入思考推理 | DeepSeek API Docs

https://api-docs.deepseek.com/zh-cn/news/news251201

[2] DeepSeek Releases New Reasoning Models to Match GPT-5, Rival Gemini 3 Pro

https://analyticsindiamag.com/ai-news-updates/deepseek-releases-new-reasoning-models-to-match-gpt-5-rival-gemini-3-pro/

[7] [8] [9] [10] [11] [12] [21] [22] DeepSeek V3.2-Exp Review. DeepSeek’s latest experimental model… | by Barnacle Goose | Oct, 2025 | Medium

https://medium.com/@leucopsis/deepseek-v3-2-exp-review-49ba1e1beb7c

[19] [23] [24] [25] [26] [27] DeepSeek V3.2 vs Gemini 3.0 vs Claude 4.5 vs GPT-5 | by Mehul Gupta | Data Science in Your Pocket | Dec, 2025 | Medium

https://medium.com/data-science-in-your-pocket/deepseek-v3-2-vs-gemini-3-0-vs-claude-4-5-vs-gpt-5-55a7d865debc

[20] deepseek-ai/DeepSeek-V3.1 - Hugging Face

https://huggingface.co/deepseek-ai/DeepSeek-V3.1