Scaling All-Sync Reinforcement Learning with DAPO and LoRA to DeepSeek 671B

MIND LABS

Motivation and Introduction

As the development of Large Language Models (LLMs) transitions from optimizing pre-training objectives to enhancing capabilities through experiential learning, Reinforcement Learning (RL) has become a critical methodology. Models such as DeepSeek-R1 have shown that RL can substantially improve the reasoning abilities of LLMs, marking a shift towards building more capable and autonomous agents.

However, a primary impediment to progress in this domain is the prohibitive computational cost associated with training state-of-the-art models. For example, applying standard RL fine-tuning techniques to a 671-billion-parameter model can necessitate a cluster of as many as 512 H100 GPUs, posing a significant barrier to entry for many academic and commercial research groups.

Furthermore, the practical application of advanced RL algorithms at this scale introduces considerable technical challenges. These include the inherent instability of the training process, inefficiencies in GPU utilization, and the operational complexity of managing large-scale distributed systems.

This paper presents a methodology for scaling RL to the 671B DeepSeek Mixture-of-Experts (MoE) model by integrating Divergence-Aware Policy Optimization (DAPO) and Low-Rank Adaptation (LoRA). Our approach enables stable and efficient training using only 48 H100 GPUs—an order of magnitude reduction in required hardware. Our key contributions include an All-Sync RL architecture that minimizes GPU idle time, thereby reducing training duration by 50% compared to baseline methods and making large-scale RL experimentation more feasible and cost-effective.

H100 GPU Usage Optimization Path

Related Work

DeepSeek V3 and R1

The release of the DeepSeek model series, particularly the 670B parameter R1-0528, validated the efficacy of applying RL at a massive scale to enhance complex reasoning. Our work builds upon the foundational research of DeepSeek AI, focusing on optimizing the underlying RL training framework to improve its accessibility and efficiency.

DAPO (Divergence-Aware Policy Optimization)

DAPO encompasses a class of policy optimization algorithms designed to regularize policy updates. We extend this work by introducing a Multi-conversation DAPO framework. In this paradigm, an agent can sequentially process segments of a long-horizon task, pause to generate an intermediate memory summary, and then resume its operation. The RL reward is calculated based on the final outcome of the entire sequence. This approach is instrumental for training agents on complex tasks, such as code generation for entire software applications, by maintaining contextual consistency through end-to-end reinforcement.

LoRA (Low-Rank Adaptation)

LoRA is an established parameter-efficient fine-tuning (PEFT) method. Its application to MoE models at the 670B-parameter scale, however, is non-trivial. We have developed a novel integration of LoRA with quantization and other optimization strategies to manage the substantial memory and compute requirements. This methodology enables stable RL training of a 671B model on a cluster of only 48 H100 GPUs, a significant reduction from the 512 GPUs required for full-parameter fine-tuning.

Technical Challenges and Solutions

LoRA Finetune for 671B DeepSeek MoE

Adapting LoRA to a 671B MoE model necessitated considerable engineering efforts, as standard implementations do not scale efficiently due to the complexity of managing numerous experts in a distributed pipeline. We developed a novel approach that combines LoRA with quantization and advanced parallelization strategies. This innovation allows for efficient training on a moderately sized GPU cluster, addressing hardware constraints that are common in many research environments.

GRPO with All-Sync Parallelization

Standard on-policy RL algorithms like Group Policy Optimization (GRPO) often lead to significant underutilization of computational resources, colloquially known as "GPU bubbles." This inefficiency arises because the training step must wait for the inference step to complete trajectory generation. While asynchronous methods can partially address this, we introduce a fully synchronous, or All-Sync RL, architecture. This framework eliminates these GPU bubbles within a strictly on-policy setting, effectively doubling training efficiency and reducing associated costs by 50% compared to baseline implementations.

Tackle the Instability of GRPO by DAPO Integration

A well-documented challenge in large-scale RL is the instability of the training process. Our integration of the Multi-conversation DAPO framework substantially improves the stability of GRPO. By structuring tasks as a sequence of conversational turns with a terminal reward signal, we can more effectively guide the model's learning trajectory over long horizons without compromising stability. This method reduces the number of training steps required to achieve equivalent performance to baseline GRPO by 50%, attributable to improved stability and convergence properties.

H100 GPU Usage Optimization Path

End-to-end Bottleneck Analysis and Optimization

A comprehensive end-to-end analysis of the entire RL pipeline was performed to identify and address key bottlenecks in computation and memory utilization. The synergistic combination of our All-Sync RL framework, LoRA adaptations, and DAPO integration resulted in a substantial reduction in the wall-clock time for a single optimization step. For instance, a representative step that previously required 9 hours to complete can now be executed in just 1.5 hours.

Experiments

Math: Balanced Reasoning

Task Definition: The objective was to solve complex mathematical problems that require multi-step reasoning, with a reward function designed to optimize for both accuracy and token efficiency. This "Balanced Reasoning" task penalizes the excessive token generation characteristic of unconstrained chain-of-thought prompting.

671B Result: Our fine-tuned model achieved 90% of the performance of the baseline R1 model on the designated mathematical benchmark while consuming only 40% of the reasoning tokens. This outcome signifies a major improvement in inference efficiency and cost-effectiveness, which is particularly advantageous for multi-turn, interactive applications.

H100 GPU Usage Optimization Path

Coding: RL for Agentic Memory

Task Definition: This task required the model to generate a complete and functional multi-component software application (e.g., front-end, back-end, database) from a single natural language prompt. Success in this task is contingent on maintaining perfect consistency and long-term memory across a generation context spanning tens of thousands of tokens.

671B Result: By employing our Multi-conv DAPO framework to foster what we term Agentic Memory, the model demonstrated the ability to maintain context and consistency over extremely long generation sequences, successfully producing entire software projects. For example, it generated a personalized daily outfit suggestion tool that retained user preferences and correctly integrated with external APIs. This end-to-end RL approach to memory represents a significant advance in the development of complex, stateful agents.

H100 GPU Usage Optimization Path

Exploration

Human-Agent Interaction

We are currently leveraging a proprietary dataset of over 548 GB, gathered from millions of user interactions on our interactive story platform, MidReal. This dataset is being used to investigate how RL can train models to be more engaging, better infer implicit user needs, and establish greater trust. By curating a high-quality subset representing the top 5% of these interactions, we aim to further enhance the model's capabilities in memory retention, tool utilization, and sophisticated conversational skills.

Acknowledgement

We extend our gratitude to the research team at MIND LABS and our collaborators from MIT, Princeton, OpenAI, and other institutions. This work would not be possible without their contributions. We also acknowledge the foundational research published by DeepSeek AI and the invaluable resources provided by the broader open-source community.

Scaling All-Sync Reinforcement Learning with DAPO and LoRA to DeepSeek 671B

Motivation and Introduction

Related Work

DeepSeek V3 and R1

DAPO (Divergence-Aware Policy Optimization)

LoRA (Low-Rank Adaptation)

Technical Challenges and Solutions

LoRA Finetune for 671B DeepSeek MoE

GRPO with All-Sync Parallelization

Tackle the Instability of GRPO by DAPO Integration

End-to-end Bottleneck Analysis and Optimization

Experiments

Math: Balanced Reasoning

Coding: RL for Agentic Memory

Exploration

Human-Agent Interaction

Acknowledgement

Share this

Related articles

Apply to become Macaron's first friends