Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the most compelling questions is how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs.

But eventually, the team hit a wall. After thousands of steps, performance gains diminished, and the model’s improvement stagnated, or even began to degrade. For more details, see Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2.

This raises a critical question: Is this plateau a fundamental limit of RL, or is it an artifact of how scaling is performed?

Today, we’re excited to introduce Broadened Reinforcement Learning (BroRL), a new paradigm that explores a complementary and powerful scaling dimension: rollout scaling. Instead of just training for more steps, BroRL dramatically increases the number of exploratory rollouts for each prompt to the order of hundreds. This approach breaks through the performance ceiling where other methods stall, and proves to be significantly more data- and compute-efficient. We are releasing our state-of-the-art 1.5B model trained with BroRL.

This post dives into related core theoretical insights, new empirical results, and why scaling rollouts is the key to unlocking the next level of reasoning in LLMs.

How does BroRL enable continuous learning?

Most RL scaling efforts focus on training length. This often leads to an unstable learning signal, where the model struggles to escape its existing knowledge base. The perceived limits of RL are often just the limits of its exploration strategy.

BroRL challenges this paradigm by focusing on rollout scaling for exploration at each update step. The goal is to move beyond incremental gains by fundamentally stabilizing the RL process, enabling continuous learning where it previously stalled.

Step scaling (ProRL, for example)	Rollout scaling (BroRL)
Scales with more training steps (3,000+)	Scales with more rollouts per prompt (N=512)
Hits a performance plateau; diminishing returns	Breaks the plateau; robust, continuous improvement
Learning signal can be unstable and noisy	Stable, high-quality updates from exhaustive exploration
Becomes inefficient at the saturation point	More compute- and data-efficient

Table 1. Core comparison of step scaling (ProRL) and rollout scaling (BroRL)

How does rollout scaling control RL instability?

As detailed in BroRL: Scaling Reinforcement Learning via Broadened Exploration, our theoretical analysis (Section 2) reveals that the RL update process is governed by two competing forces: sampled rollouts and unsampled space.

To provide an analogy, think of it like exploring a vast, foggy landscape to find the highest peak. The paths you actually walk (sampled rollouts) provide reliable, positive feedback, helping you gain altitude. Yet the infinite number of paths you don’t take (the unsampled space) create uncertainty and noise. This noise acts like a gravitational pull, dragging you back down the hill. When you only send out a few scouts (N=16 in ProRL), their reports are noisy, and this downward pull can be strong enough to halt your ascent, leaving you stuck on a plateau.

The BroRL solution is simple but powerful: send out an entire army of scouts (N=512). By mapping a huge portion of the landscape, the random noise from the unexplored fog averages out and becomes overwhelmingly strong. The “upward signal” from all the successful paths becomes overwhelmingly strong.

In our formal analysis, this means the net change in the model’s performance becomes positive ( $\Delta Q_{pos} \ge 0$ ) when N is large. This provides a stable, high-quality learning signal that allows the model to climb past the plateau.

Breaking through the RL performance plateau

We applied the BroRL recipe to a strong ProRLv2 model that had already plateaued after 3,000 training steps. The results were definitive.

Figure 1 tells a powerful story. While continuing with the ProRL recipe (blue line) leads to stagnation and eventual degradation, BroRL (orange line) revives the model, enabling robust and continuous performance gains that break through the previous ceiling.

A line graph titled ‘Math Score Improvement Over Time’ that displays two lines representing different training methods, labeled ProRL and BroRL, against the training time in hours on the x-axis. — *Figure 1. BroRL (N=512) demonstrates continuous performance improvement on the Math benchmark, whereas ProRL (N=16) reaches a plateau and degrades with prolonged training*

BroRL comprehensive results

We continued training the 3,000-step ProRLv2 checkpoint using both the original recipe (N=16) and the new BroRL recipe (N=512) using 64 NVIDIA H100 GPUs. The divergence was clear: ProRL stagnated, while BroRL delivered steady, significant gains in less time.

Method	N	RL steps	Total time (h)	Math score	Code score	Reasoning Gym score
Baseline	16	2,000	–	60.14	51.43	59.06
Baseline	16	3,000	–	61.69	52.00	61.29
ProRL	16	3,000+225	+56.3	62.08	52.26	62.10
ProRL	16	3,000+535	+133.8	62.02 (stagnated)	52.74	61.45 (degraded)
BroRL	512	3,000+107	+98.1	62.62	53.31	62.71
BroRL	512	3,000+134	+122.8	62.85	53.48	62.82
BroRL	512	3,000+419	+393.9	63.66	56.64	63.40

Table 2. Comprehensive performance comparison of BroRL and ProRL on key reasoning benchmarks

After just 98.1 hours, BroRL had already decisively surpassed the final performance of the ProRL method across all metrics, doing so in approximately 35 fewer hours. This confirms that scaling rollout size is a more effective and computationally efficient strategy for pushing the boundaries of a saturated model.

BroRL sets a state-of-the-art for 1.5B reasoning models, achieving the highest scores in Math (63.66), Code (56.64), and Reasoning Gym (63.40) benchmarks.

Superior compute efficiency

BroRL isn’t just better—it’s faster and smarter with its compute.

Algorithmic efficiency: Large-N rollouts produce a more diverse set of candidate samples. The pass rate for dynamic sampling, which filters out uninformative trajectories, jumped from 41% to 62%, meaning less computation was wasted.
Hardware efficiency: BroRL shifts the generation process from being memory-bound to compute-bound and improves the prefix cache hit rate. Consequently, the GPU can fully utilize its parallel processing power, nearly doubling the throughput from 36.5 to 72.4 samples/s in our hardware setup.

Method (N)	Dynamic sampling pass rate	Generation throughput (samples/s)
ProRL (16)	41%	36.5
BroRL (512)	62%	72.4

Table 3. Compute efficiency metrics for BroRL versus ProRL (sampling pass rate and throughput)

Greater token efficiency

BroRL delivers higher accuracy with fewer output tokens on both Math and Code benchmarks, indicating better score-per-token efficiency and tighter, less redundant reasoning.

Large-N rollout exploration (N=512) surfaces many concise, high-yield trajectories per prompt, which both raises the chance of sampling compact correct chains and reduces reliance on verbose, low-signal reasoning. This decouples quality from response length where step-scaling typically inflates tokens.

Task	ProRL score	BroRL score	Score diff	ProRL tokens	BroRL tokens	Token diff
Math	62.02	63.66	+1.64	16,506	15,760	-745
Code	52.74	56.64	+3.90	26,808	26,090	-717

Table 4. Token efficiency comparison of BroRL and ProRL on math and code tasks

Get started with BroRL

Our findings establish rollout size not just as a hyperparameter, but as a critical and efficient axis for scaling reinforcement learning. The performance plateaus encountered by step-scaling methods are not fundamental limits of RL but artifacts of insufficient exploration. Key insights and takeaways include:

Rollout scaling is a new, crucial scaling dimension for RL. It provides a stable learning signal where depth-scaling alone fails.
Performance plateaus are not dead ends. They can be overcome by scaling rollouts to generate higher-quality policy updates.
BroRL is more computationally efficient, doubling hardware throughput and improving algorithmic sample efficiency.
BroRL is more token efficient, achieving more with less.
The new BroRL-trained checkpoint sets a state-of-the-art for 1.5B reasoning models.

For those looking to maximize the potential of their models with RL, BroRL provides a principled path forward: when you hit a wall, don’t just push forward—go wider.

To get started, explore and evaluate the BroRwL model, available through Hugging Face.