Agentic AI / Generative AI

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Decorative image.

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy Optimization (GRPO) power this transition, enabling reasoning-grade models to continuously improve through iterative feedback. Unlike standard supervised fine-tuning, RL training loops are bifurcated into two distinct, high-intensity phases: a generation phase with a stringent latency requirement and a training phase requiring high throughput.

To make these workloads viable, researchers and engineers are turning to low-precision datatypes like FP8 to boost performance in training and throughput-oriented generation. Moreover, in some scenarios where generation is bound by GPU memory bandwidth, using low-precision parameters can improve performance due to fewer bytes per parameter. 

This post dives deep into the systemic challenges of low-precision RL and how NVIDIA NeMo RL—an open source library within the NVIDIA NeMo framework—speeds up RL workloads while maintaining accuracy.

FP8 for linear layers in RL

Our recipe uses the block-wise quantized FP8 introduced by the DeepSeek-V3 Technical Report. Table 1 gives the details of tensor formats in linear projection layers.

TensorType of dataQuantization granularityScaling factor Type of Scaling
WeightsFP8 (E4M3)[128, 128]FP32Block-wise
Input activationsFP8 (E4M3)[1, 128]FP32Block-wise
Output gradientsFP8 (E4M3)[1, 128]FP32Block-wise
Table 1. Tensor formats in linear projection layers

With this recipe, linear layers can be computed with FP8 math, which has 2x peak throughput versus BF16 math. Other modules, including ‌attention, normalization, non-linear functions, and output projections, are computed with BF16 math.

The challenge of numerical disagreement in RL

RL pipelines typically use separate engines: vLLM for rollouts and NVIDIA Megatron Core for training. Each uses unique custom NVIDIA CUDA kernels to maximize performance. This inherently introduces numerical differences that cumulatively magnify in lower precision due to additional quantization and dequantization logic. We quantify this numeric difference as a token multiplicative probability error:

\(\texttt{token-mult-prob-error} = \frac{1}{n}\sum_{i=1}^{n(\texttt{tokens})}exp(\left| \texttt{log-train-fwk}_i – \texttt{logprobs-inference-fwk}_i \right|)\)

Perfect alignment earns a score of 1, and we typically find ‘acceptable’ values to be <1.03-1.05 when not using any additional techniques.

End-to-end FP8 in linear layers reduce numerical disagreement

During the development of the FP8 recipe, we experimented with three recipes:

  • Baseline recipe: BF16 for both generation and training.
  • Recipe candidate 1: FP8 is applied exclusively during generation, while policy model training is conducted in BF16.
  • Final recipe: End-to-end FP8: we use FP8 in both generation and training engines

We observe that compared to recipe candidate 1 with FP8 only for generation, the final recipe consistently shows a lower numerical disagreement between generation and training. Note that the baseline recipe always gives the lowest numerical disagreement. Figure 1 shows the token multiplicative probability error metric of the three recipes. 

Mitigating numerical disagreement with importance sampling

Importance sampling is used to correct the distribution mismatch between the model (i.e., distribution) that generates the data, and the model (i.e., distribution) that is being trained. It is a per-token weight multiplied by loss. You can refer to our GRPO documentation for the detailed theoretical background of importance sampling. 

Experiments show that:

  • For recipe candidate 1 (FP8 generation and BF16 training), importance sampling can narrow the accuracy gap from BF16 RL, but can’t close the gap.
  • For the final recipe (end-to-end FP8), importance sampling completely closes the gap from BF16 training. Figure 2 shows the validation accuracy during training for different recipes.

Results for FP8 Linear Layer E2E

We evaluate the end-to-end FP8 recipe on both dense and mixture-of-experts models, measuring validation accuracy and training throughput against the BF16 baseline.

FP8 end-to-end on dense models: Llama 3.1 8B Instruct

Table 2 shows the accuracy of the FP8 end-to-end recipe and BF16 recipe in GRPO training of Llama 3.1 8B instruct model and math dataset trained to 4000 steps.

PrecisionBF16FP8 generation onlyFP8 End-to-End
Validation accuracy0.6160.5860.613
Table 2: Accuracy results for Llama3 8B validation accuracy across different precision configs

In terms of speed up, the FP8 recipe achieves a consistent >15% throughput improvement compared to BF16. Figure 3 is the GRPO training (tokens per second per GPU) of two recipes over 1000 steps.

Although the theoretical speedup of FP8 over BF16 is 2x, in practice, it is lower because only linear layers benefit from faster math throughput, whereas the attention and elementwise layers stay the same. The extra quantization kernels added before linear layers introduce some overhead. The 15%-25% speedup matches our standalone test of vLLM. With further optimizations such as fusing quantization kernels in vLLM, we project that the speedup can be further improved to 1.25x.

FP8 end-to-end on MoE models: Qwen3-30B

Similar experiments were run on mixture-of-experts (MoE) models, with results for Qwen3-30B showing matching accuracy curves. FP8 achieves similar accuracy to BF16. Speed gain is being investigated.

Extending FP8 for KV cache and attention

With a transformer model, linear layers are not the only bottleneck. KV cache growth and attention computation often dominate the end-to-end rollout time in RL workflows with long output sequence lengths (OSL) while also saturating memory bandwidth and slowing down token generation. This motivated us to explore FP8 for KV cache and attention in the loop of RL. Per-tensor scaling FP8 is used.  

Implementing FP8 for KV-cache in an RL setting is uniquely challenging because policy weights change at every step. Unlike static inference, where calibration happens once, RL requires dynamic handling of quantization scales.

NeMo RL adopts the following approach to solve this:

  1. Recalibration: At the end of each training step, the trainer recalibrates the Query, Key, Value (QKV) scales using the updated policy weights.
  2. Data selection: This calibration is performed using the training data (prompts and generated responses) to ensure the scales reflect the current distribution.
  3. Synchronization: The newly calculated scales are then synchronized to the inference engine (vLLM) for the subsequent rollout phase.

This design ensures that the rollout engine always uses optimal quantization scales derived from the latest policy state, minimizing accuracy degradation. The calibration overhead is minimal, consuming approximately 2-3% of the total step time.

TensorType of dataScaling factor Type of scaling
QKV attention activationsFP8 (E4M3)FP32Tensor-wise
Stored KV cacheFP8 (E4M3)FP32Tensor-wise
Table 3: Tensor formats for attention activations and stored KV cache

Summary of results for FP8 on KV cache and attention

We ran results on the Qwen3-8B-Base model using the GRPO algorithm, with FP8 applied in rollout and BF16 for training. While the mismatch KL divergence is slightly higher when quantizing both KV cache and attention due to compounded errors, our recipe mitigates instability. By enabling token-level truncated importance sampling, the FP8 for both linear + KV cache + attention achieves validation accuracy alignment with the BF16 baseline and the FP8 for the linear layer (W8A8). 

Enabling FP8 for both KV-cache and attention operations yields an additional ~30% speedup on the rollout stage over the linear W8A8 configuration, resulting in an overall ~48% speedup compared to the BF16 baseline. These gains are particularly pronounced at longer response lengths, where attention computations constitute a larger fraction of the overall workload. The QKV scale recalibration process consumes approximately 2-3% of the total step time, representing a minor cost relative to the substantial acceleration achieved.

Try End-to-End FP8 with NVIDIA NeMo RL

To enable FP8 for linear layers in both generation and training backends, the following config map shows how each tuning parameter gets passed to the training and generation backends. 

To enable FP8 for KV cache and attention, one needs to configure the kv_cache_dtype parameter in vllm_cfg for the policy, which automatically handles the QKV scale recalibration on the trainer side and synchronization with the vLLM backend.

policy:
  generation:
    vllm_cfg:
      precision: fp8       # Enable FP8 for linear layers
      kv_cache_dtype: fp8  # Enable FP8 for KV-cache

Advanced FP8 configuration options for generation and training

So far, we have introduced the implementation of FP8 for linear layers and KV cache + attention layers. Advanced users can experiment with variants of the recipe. The following are examples of some of the features:

  • Keeping first N and/or last M transformer layers in BF16 during generation (N, M are integers)
policy:
  generation:
    vllm_cfg:
      num_first_layers_in_bf16: N # replace N with an integer
      num_last_layers_in_bf16: M  # replace M with an integer
  • Configure generation and/or training to use power-of-2 scaling factor type instead of FP32
policy:
  generation:
    vllm_cfg:
      pow2_weight_scaling_factors: true
      pow2_activation_scaling_factors: true
  megatron_cfg:
    env_vars:
      NVTE_FP8_BLOCK_SCALING_FP32_SCALES: "0"
  • Developers can use variants of FP8 recipes predefined for the Megatron Core backend, instead of the default block-wise quantized FP8 recipe, as Table 1 shows. Refer to the argument docstring for details.
policy:
  megatron_cfg:
    fp8_cfg:
      fp8: "e4m3"
      fp8_recipe: "blockwise"

Get started

Users can start by referring to the llama-3.1-8b and moonlight-16b recipes in the NeMo RL GitHub

Acknowledgements

This work was a collaborative effort across teams. We’d like to thank Jimmy Zhang, Victor Cui, Zhiyu Li, and Lark Zhang for their work on the FP8 recipe development, experimentation, and integration into NeMo RL. 

Discuss (0)

Tags