Introduction: What Is Happening at the Intersection of LLMs and RL
From 2025 into 2026, reinforcement learning (RL) has come to play an unprecedentedly important role in improving the performance of large language models (LLMs). Behind the dramatic gains in reasoning capabilities seen in models such as DeepSeek and QwQ lies RL-based post-training. In practice, however, applying RL to LLMs presents serious challenges from the perspective of computational efficiency.
Hugging Face's research team has published a comprehensive survey of this ecosystem titled "Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries." [Source: https://huggingface.co/blog/async-rl-training-landscape] This article uses that survey as its foundation to explain the technical realities and challenges of asynchronous RL training.
Why "Keeping the Tokens Flowing" Matters
The biggest bottleneck in RL training for LLMs is low GPU utilization. In supervised fine-tuning (SFT), data can be processed in batches, keeping GPUs running at nearly 100% capacity. In RL, however, the following cycle repeats continuously:
- Rollout generation: The current policy (model) generates tokens sequentially
- Reward computation: A reward model evaluates the generated sequences
- Parameter update: Gradients are computed and the model is updated
In synchronous RL, these three steps execute sequentially, meaning training halts during rollouts and rollouts halt during updates. This "dead time" is what significantly reduces GPU utilization. The title "Keep the Tokens Flowing" succinctly captures the importance of eliminating this inefficiency.
The Architectural Diversity Revealed by 16 Libraries
The 16 open-source libraries surveyed include VERL, OpenRLHF, TRL (GRPOTrainer), NeMo-Aligner, DeepSpeed-Chat, SkyRL, EasyR1, LLaMA-Factory, and others. These can broadly be classified into two architectural categories.
Synchronous RL Training
The simplest approach, executing rollout -> reward -> update in series. While easy to implement and debug, GPU utilization often remains around 30-50%. TRL's GRPOTrainer and DeepSpeed-Chat are designed to fall into this category.
Asynchronous RL Training
In the asynchronous approach, the actor (inference) and learner (weight updates) operate in parallel. Because the actor can continue generating tokens while the learner performs updates, throughput improves. However, a divergence (the off-policy problem) arises between the policy used to generate rollouts and the policy after updates, which carries the risk of reduced learning stability.
VERL is a particularly notable library in this space, featuring a flexible architecture that combines vLLM and DeepSpeed. SkyRL actively adopts asynchronous scheduling and aims for high throughput on large-scale clusters.
GRPO vs. PPO: Algorithm Selection in Practice
On the algorithm side, GRPO (Group Relative Policy Optimization) has rapidly gained adoption alongside the traditional PPO (Proximal Policy Optimization). GRPO's greatest advantage is that it requires no critic model, reducing memory consumption by roughly half. DeepSeek-R1's adoption of GRPO brought it widespread recognition.
PPO, on the other hand, benefits from variance reduction via a value function and is considered superior in learning stability for complex tasks. The survey of 16 libraries shows that while more libraries are trending toward GRPO, the practical advantages of each remain task-dependent. [Source: https://huggingface.co/blog/async-rl-training-landscape]
The Three Major Implementation Challenges
1. The Dead Token Problem
In asynchronous settings, generated tokens easily become "stale" samples that are inconsistent with the updated policy. Using these dead tokens for training causes off-policy bias to accumulate. Correction via importance sampling is effective, but when samples exceed the clipping range, they must be discarded, which reduces effective throughput.
2. Integration with Distributed Inference
When incorporating fast inference engines such as vLLM or SGLang into the RL loop, weight synchronization overhead becomes a problem. Designs that allocate separate GPU pools for rollouts and training (actor-learner separation) are effective, but come with the cost trade-off of requiring nearly twice as many GPUs.
3. Reward Hacking and Scalability
Reward hacking—over-optimizing for reward model scores—is a classic problem across RL in general. In the context of LLMs, cases have been reported where models generate outputs that exploit weaknesses in the reward model, making diversification of rewards and appropriate KL penalty settings essential. Even in the case where the NeMo Agent Toolkit achieved first place on the DABStep benchmark, an approach was demonstrated that sidesteps this problem through the innovation of reusable tool generation. [Source: https://huggingface.co/blog/nvidia/nemo-agent-toolkit-data-explorer-dabstep-1st-place]
Guidelines for Library Selection
Choosing which library to use in practice becomes clearer when considered from the following perspectives:
- Research and prototyping focus: Libraries with simple, easy-to-modify implementations, such as TRL's GRPOTrainer or EasyR1, are suitable
- Throughput focus on large-scale clusters: Libraries that adopt asynchronous architectures with mature vLLM integration, such as VERL or SkyRL, are advantageous
- Compatibility with existing infrastructure: DeepSpeed-Chat has high affinity with the DeepSpeed ecosystem and low migration costs from existing SFT pipelines
- Multimodal and speech support: When handling compact edge-oriented models such as IBM's Granite 4.0 Speech, it is necessary to consider combinations with lightweight inference [Source: https://huggingface.co/blog/ibm-granite/granite-4-speech]
Future Outlook
The most important insight from the survey of 16 libraries is the fact that "the efficiency of RL training is still evolving." As GPU cluster sizes expand and model sizes grow, the importance of asynchronization and distribution will increase further. At the same time, there are still many areas where consensus has not been reached regarding solutions to algorithm stability and off-policy problems.
From an engineering perspective, large-scale data management infrastructure such as the Storage Buckets introduced to HuggingFace Hub is also becoming an important foundation supporting the ecosystem for large-scale RL experiments. [Source: https://huggingface.co/blog/storage-buckets]
The very fact that the open-source community is trying as many as 16 diverse approaches is itself evidence that solutions in this field have not yet been established. For researchers and engineers, this is precisely the frontier to enter and contribute to.
Category: LLM | Tags: 強化学習, LLM, RLトレーニング, GRPO, オープンソース
0 件のコメント:
コメントを投稿