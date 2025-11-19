Technical Blog
Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

Nov 19, 2025
By , , , and
When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the most compelling questions is how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs. 

But eventually, the team hit a wall. After thousands of steps, performance gains diminished, and the model’s improvement stagnated, or even began to degrade. For more details, see Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2

This raises a critical question: Is this plateau a fundamental limit of RL, or is it an artifact of how scaling is performed? 

Today, we’re excited to introduce Broadened Reinforcement Learning (BroRL), a new paradigm that explores a complementary and powerful scaling dimension: rollout scaling. Instead of just training for more steps, BroRL dramatically increases the number of exploratory rollouts for each prompt to the order of hundreds. This approach breaks through the performance ceiling where other methods stall, and proves to be significantly more data- and compute-efficient. We are releasing our state-of-the-art 1.5B model trained with BroRL. 

This post dives into related core theoretical insights, new empirical results, and why scaling rollouts is the key to unlocking the next level of reasoning in LLMs.

How does BroRL enable continuous learning?

Most RL scaling efforts focus on training length. This often leads to an unstable learning signal, where the model struggles to escape its existing knowledge base. The perceived limits of RL are often just the limits of its exploration strategy.

BroRL challenges this paradigm by focusing on rollout scaling for exploration at each update step. The goal is to move beyond incremental gains by fundamentally stabilizing the RL process, enabling continuous learning where it previously stalled.

Step scaling (ProRL, for example)Rollout scaling (BroRL)
Scales with more training steps (3,000+)Scales with more rollouts per prompt (N=512)
Hits a performance plateau; diminishing returnsBreaks the plateau; robust, continuous improvement
Learning signal can be unstable and noisyStable, high-quality updates from exhaustive exploration
Becomes inefficient at the saturation pointMore compute- and data-efficient
Table 1. Core comparison of step scaling (ProRL) and rollout scaling (BroRL)

How does rollout scaling control RL instability?

As detailed in BroRL: Scaling Reinforcement Learning via Broadened Exploration, our theoretical analysis (Section 2) reveals that the RL update process is governed by two competing forces: sampled rollouts and unsampled space. 

To provide an analogy, think of it like exploring a vast, foggy landscape to find the highest peak. The paths you actually walk (sampled rollouts) provide reliable, positive feedback, helping you gain altitude. Yet the infinite number of paths you don’t take (the unsampled space) create uncertainty and noise. This noise acts like a gravitational pull, dragging you back down the hill. When you only send out a few scouts (N=16 in ProRL), their reports are noisy, and this downward pull can be strong enough to halt your ascent, leaving you stuck on a plateau.

The BroRL solution is simple but powerful: send out an entire army of scouts (N=512). By mapping a huge portion of the landscape, the random noise from the unexplored fog averages out and becomes overwhelmingly strong. The “upward signal” from all the successful paths becomes overwhelmingly strong. 

In our formal analysis, this means the net change in the model’s performance becomes positive (\Delta Q_{pos} \ge 0) when N is large. This provides a stable, high-quality learning signal that allows the model to climb past the plateau.

Breaking through the RL performance plateau

We applied the BroRL recipe to a strong ProRLv2 model that had already plateaued after 3,000 training steps. The results were definitive.

Figure 1 tells a powerful story. While continuing with the ProRL recipe (blue line) leads to stagnation and eventual degradation, BroRL (orange line) revives the model, enabling robust and continuous performance gains that break through the previous ceiling.

A line graph titled ‘Math Score Improvement Over Time’ that displays two lines representing different training methods, labeled ProRL and BroRL, against the training time in hours on the x-axis.
Figure 1. BroRL (N=512) demonstrates continuous performance improvement on the Math benchmark, whereas ProRL (N=16) reaches a plateau and degrades with prolonged training

BroRL comprehensive results 

We continued training the 3,000-step ProRLv2 checkpoint using both the original recipe (N=16) and the new BroRL recipe (N=512) using 64 NVIDIA H100 GPUs. The divergence was clear: ProRL stagnated, while BroRL delivered steady, significant gains in less time.

MethodNRL stepsTotal time (h)Math scoreCode scoreReasoning Gym score
Baseline162,00060.1451.4359.06
Baseline163,00061.6952.0061.29
ProRL163,000+225+56.362.0852.2662.10
ProRL163,000+535+133.862.02 (stagnated)52.7461.45 (degraded)
BroRL5123,000+107+98.162.6253.3162.71
BroRL5123,000+134+122.862.8553.4862.82
BroRL5123,000+419+393.963.6656.6463.40
Table 2. Comprehensive performance comparison of BroRL and ProRL on key reasoning benchmarks

After just 98.1 hours, BroRL had already decisively surpassed the final performance of the ProRL method across all metrics, doing so in approximately 35 fewer hours. This confirms that scaling rollout size is a more effective and computationally efficient strategy for pushing the boundaries of a saturated model.

BroRL sets a state-of-the-art for 1.5B reasoning models, achieving the highest scores in Math (63.66), Code (56.64), and Reasoning Gym (63.40) benchmarks.

Superior compute efficiency

BroRL isn’t just better—it’s faster and smarter with its compute.

  • Algorithmic efficiency: Large-N rollouts produce a more diverse set of candidate samples. The pass rate for dynamic sampling, which filters out uninformative trajectories, jumped from 41% to 62%, meaning less computation was wasted.
  • Hardware efficiency: BroRL shifts the generation process from being memory-bound to compute-bound and improves the prefix cache hit rate. Consequently, the GPU can fully utilize its parallel processing power, nearly doubling the throughput from 36.5 to 72.4 samples/s in our hardware setup.
Method (N)Dynamic sampling pass rateGeneration throughput (samples/s)
ProRL (16)41%36.5
BroRL (512)62%72.4
Table 3. Compute efficiency metrics for BroRL versus ProRL (sampling pass rate and throughput)

Greater token efficiency

BroRL delivers higher accuracy with fewer output tokens on both Math and Code benchmarks, indicating better score-per-token efficiency and tighter, less redundant reasoning.

Large-N rollout exploration (N=512) surfaces many concise, high-yield trajectories per prompt, which both raises the chance of sampling compact correct chains and reduces reliance on verbose, low-signal reasoning. This decouples quality from response length where step-scaling typically inflates tokens.

TaskProRL scoreBroRL scoreScore diffProRL tokensBroRL tokensToken diff
Math62.0263.66+1.6416,50615,760-745
Code52.7456.64+3.9026,80826,090-717
Table 4. Token efficiency comparison of BroRL and ProRL on math and code tasks

Get started with BroRL

Our findings establish rollout size not just as a hyperparameter, but as a critical and efficient axis for scaling reinforcement learning. The performance plateaus encountered by step-scaling methods are not fundamental limits of RL but artifacts of insufficient exploration. Key insights and takeaways include:

  • Rollout scaling is a new, crucial scaling dimension for RL. It provides a stable learning signal where depth-scaling alone fails.
  • Performance plateaus are not dead ends. They can be overcome by scaling rollouts to generate higher-quality policy updates.
  • BroRL is more computationally efficient, doubling hardware throughput and improving algorithmic sample efficiency.
  • BroRL is more token efficient, achieving more with less.
  • The new BroRL-trained checkpoint sets a state-of-the-art for 1.5B reasoning models.

For those looking to maximize the potential of their models with RL, BroRL provides a principled path forward: when you hit a wall, don’t just push forward—go wider.

To get started, explore and evaluate the BroRwL model, available through Hugging Face. 

Acknowledgments

Thank you to Yejin Choi, Fang Wu, Zaid Harchaoui, Pavlo Molchanov, Jan Kautz, and Jun Yang for their contributions to this post.

About the Authors

Avatar photo
About Jian Hu
Jian Hu is a senior deep learning engineer at NVIDIA, focusing on large language models (LLMs) and reinforcement learning from human feedback (RLHF). He received his master’s degree in Computer Science from National Taiwan University and began a PhD program at HKUST(GZ), which he later chose to leave. Jian has five years of working experience in computer engineering and machine learning, and is the first author of popular RLHF projects OpenRLHF and REINFORCE++. His interests include reinforcement learning, artificial general intelligence (AGI), and the model-system co-optimization.
Avatar photo
About Shizhe Diao
Shizhe Diao is a research scientist at NVIDIA Research and is working on the research in efficient training and alignment of foundation models. He completed his PhD at the Hong Kong University of Science and Technology. Shizhe has seven years of experience in machine learning and natural language processing, and is the first author of the popular post-training project LMFlow.
Avatar photo
About Mingjie Liu
Mingjie Liu is a senior research scientist on the Learning and Perception Research team at NVIDIA. He completed his PhD at the University of Texas at Austin. His current research centers on using reinforcement learning to strengthen LLM reasoning and agentic performance toward general artificial intelligence. He previously worked on customizing domain specific LLMs for chip design, including ChipNeMo and RTL code generation.
Avatar photo
About Ximing Lu
Ximing Lu is a research scientist on the Language and Cognition Research team at NVIDIA. She previously earned her bachelor’s degree in Computer Science at University of Washington. Her research interest centers around data synthesis, reinforcement learning, agentic system, model architecture, and multimodality. She is a co-recipient of the Best Paper Award at NAACL 2022 and the Outstanding Paper Award at EMNLP 2023.
Avatar photo
About Yi Dong
Yi Dong is a principal research scientist on the Learning and Perception Research team at NVIDIA, where he leads efforts in developing reasoning models and virtual agents. He earned his PhD in Computational Neuroscience from the Johns Hopkins University School of Medicine. With over a decade of experience in software engineering, machine learning, finance and AI research, Yi’s research focuses on building artificial general intelligence (AGI) systems that emulate human cognitive capabilities, particularly in enabling models to reason effectively in complex and novel situations for enterprise applications.

