As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy Optimization (GRPO) power this transition, enabling reasoning-grade models to continuously improve through iterative feedback. Unlike standard supervised fine-tuning, RL training loops are bifurcated into two distinct, high-intensity phases: a generation phase with a stringent latency requirement and a training phase requiring high throughput.
To make these workloads viable, researchers and engineers are turning to low-precision datatypes like FP8 to boost performance in training and throughput-oriented generation. Moreover, in some scenarios where generation is bound by GPU memory bandwidth, using low-precision parameters can improve performance due to fewer bytes per parameter.
This post dives deep into the systemic challenges of low-precision RL and how NVIDIA NeMo RL—an open source library within the NVIDIA NeMo framework—speeds up RL workloads while maintaining accuracy.
FP8 for linear layers in RL
Our recipe uses the block-wise quantized FP8 introduced by the DeepSeek-V3 Technical Report. Table 1 gives the details of tensor formats in linear projection layers.
| Tensor | Type of data | Quantization granularity | Scaling factor | Type of Scaling |
| Weights | FP8 (E4M3) | [128, 128] | FP32 | Block-wise |
| Input activations | FP8 (E4M3) | [1, 128] | FP32 | Block-wise |
| Output gradients | FP8 (E4M3) | [1, 128] | FP32 | Block-wise |
With this recipe, linear layers can be computed with FP8 math, which has 2x peak throughput versus BF16 math. Other modules, including attention, normalization, non-linear functions, and output projections, are computed with BF16 math.
The challenge of numerical disagreement in RL
RL pipelines typically use separate engines: vLLM for rollouts and NVIDIA Megatron Core for training. Each uses unique custom NVIDIA CUDA kernels to maximize performance. This inherently introduces numerical differences that cumulatively magnify in lower precision due to additional quantization and dequantization logic. We quantify this numeric difference as a token multiplicative probability error:
\(\texttt{token-mult-prob-error} = \frac{1}{n}\sum_{i=1}^{n(\texttt{tokens})}exp(\left| \texttt{log-train-fwk}_i – \texttt{logprobs-inference-fwk}_i \right|)\)
Perfect alignment earns a score of 1, and we typically find ‘acceptable’ values to be <1.03-1.05 when not using any additional techniques.
End-to-end FP8 in linear layers reduce numerical disagreement
During the development of the FP8 recipe, we experimented with three recipes:
- Baseline recipe: BF16 for both generation and training.
- Recipe candidate 1: FP8 is applied exclusively during generation, while policy model training is conducted in BF16.
- Final recipe: End-to-end FP8: we use FP8 in both generation and training engines
We observe that compared to recipe candidate 1 with FP8 only for generation, the final recipe consistently shows a lower numerical disagreement between generation and training. Note that the baseline recipe always gives the lowest numerical disagreement. Figure 1 shows the token multiplicative probability error metric of the three recipes.

Mitigating numerical disagreement with importance sampling
Importance sampling is used to correct the distribution mismatch between the model (i.e., distribution) that generates the data, and the model (i.e., distribution) that is being trained. It is a per-token weight multiplied by loss. You can refer to our GRPO documentation for the detailed theoretical background of importance sampling.
Experiments show that:
- For recipe candidate 1 (FP8 generation and BF16 training), importance sampling can narrow the accuracy gap from BF16 RL, but can’t close the gap.
- For the final recipe (end-to-end FP8), importance sampling completely closes the gap from BF16 training. Figure 2 shows the validation accuracy during training for different recipes.

Results for FP8 Linear Layer E2E
We evaluate the end-to-end FP8 recipe on both dense and mixture-of-experts models, measuring validation accuracy and training throughput against the BF16 baseline.
FP8 end-to-end on dense models: Llama 3.1 8B Instruct
Table 2 shows the accuracy of the FP8 end-to-end recipe and BF16 recipe in GRPO training of Llama 3.1 8B instruct model and math dataset trained to 4000 steps.
| Precision | BF16 | FP8 generation only | FP8 End-to-End |
| Validation accuracy | 0.616 | 0.586 | 0.613 |
In terms of speed up, the FP8 recipe achieves a consistent >15% throughput improvement compared to BF16. Figure 3 is the GRPO training (tokens per second per GPU) of two recipes over 1000 steps.

Although the theoretical speedup of FP8 over BF16 is 2x, in practice, it is lower because only linear layers benefit from faster math throughput, whereas the attention and elementwise layers stay the same. The extra quantization kernels added before linear layers introduce some overhead. The 15%-25% speedup matches our standalone test of vLLM. With further optimizations such as fusing quantization kernels in vLLM, we project that the speedup can be further improved to 1.25x.
FP8 end-to-end on MoE models: Qwen3-30B
Similar experiments were run on mixture-of-experts (MoE) models, with results for Qwen3-30B showing matching accuracy curves. FP8 achieves similar accuracy to BF16. Speed gain is being investigated.

Extending FP8 for KV cache and attention
With a transformer model, linear layers are not the only bottleneck. KV cache growth and attention computation often dominate the end-to-end rollout time in RL workflows with long output sequence lengths (OSL) while also saturating memory bandwidth and slowing down token generation. This motivated us to explore FP8 for KV cache and attention in the loop of RL. Per-tensor scaling FP8 is used.
Implementing FP8 for KV-cache in an RL setting is uniquely challenging because policy weights change at every step. Unlike static inference, where calibration happens once, RL requires dynamic handling of quantization scales.
NeMo RL adopts the following approach to solve this:
- Recalibration: At the end of each training step, the trainer recalibrates the Query, Key, Value (QKV) scales using the updated policy weights.
- Data selection: This calibration is performed using the training data (prompts and generated responses) to ensure the scales reflect the current distribution.
- Synchronization: The newly calculated scales are then synchronized to the inference engine (vLLM) for the subsequent rollout phase.

This design ensures that the rollout engine always uses optimal quantization scales derived from the latest policy state, minimizing accuracy degradation. The calibration overhead is minimal, consuming approximately 2-3% of the total step time.
| Tensor | Type of data | Scaling factor | Type of scaling |
| QKV attention activations | FP8 (E4M3) | FP32 | Tensor-wise |
| Stored KV cache | FP8 (E4M3) | FP32 | Tensor-wise |
Summary of results for FP8 on KV cache and attention
We ran results on the Qwen3-8B-Base model using the GRPO algorithm, with FP8 applied in rollout and BF16 for training. While the mismatch KL divergence is slightly higher when quantizing both KV cache and attention due to compounded errors, our recipe mitigates instability. By enabling token-level truncated importance sampling, the FP8 for both linear + KV cache + attention achieves validation accuracy alignment with the BF16 baseline and the FP8 for the linear layer (W8A8).

Enabling FP8 for both KV-cache and attention operations yields an additional ~30% speedup on the rollout stage over the linear W8A8 configuration, resulting in an overall ~48% speedup compared to the BF16 baseline. These gains are particularly pronounced at longer response lengths, where attention computations constitute a larger fraction of the overall workload. The QKV scale recalibration process consumes approximately 2-3% of the total step time, representing a minor cost relative to the substantial acceleration achieved.

Try End-to-End FP8 with NVIDIA NeMo RL
To enable FP8 for linear layers in both generation and training backends, the following config map shows how each tuning parameter gets passed to the training and generation backends.

To enable FP8 for KV cache and attention, one needs to configure the kv_cache_dtype parameter in vllm_cfg for the policy, which automatically handles the QKV scale recalibration on the trainer side and synchronization with the vLLM backend.
policy:
generation:
vllm_cfg:
precision: fp8 # Enable FP8 for linear layers
kv_cache_dtype: fp8 # Enable FP8 for KV-cache
Advanced FP8 configuration options for generation and training
So far, we have introduced the implementation of FP8 for linear layers and KV cache + attention layers. Advanced users can experiment with variants of the recipe. The following are examples of some of the features:
- Keeping first N and/or last M transformer layers in BF16 during generation (N, M are integers)
policy:
generation:
vllm_cfg:
num_first_layers_in_bf16: N # replace N with an integer
num_last_layers_in_bf16: M # replace M with an integer
- Configure generation and/or training to use power-of-2 scaling factor type instead of FP32
policy:
generation:
vllm_cfg:
pow2_weight_scaling_factors: true
pow2_activation_scaling_factors: true
megatron_cfg:
env_vars:
NVTE_FP8_BLOCK_SCALING_FP32_SCALES: "0"
- Developers can use variants of FP8 recipes predefined for the Megatron Core backend, instead of the default block-wise quantized FP8 recipe, as Table 1 shows. Refer to the argument docstring for details.
policy:
megatron_cfg:
fp8_cfg:
fp8: "e4m3"
fp8_recipe: "blockwise"
Get started
Users can start by referring to the llama-3.1-8b and moonlight-16b recipes in the NeMo RL GitHub.
Acknowledgements
This work was a collaborative effort across teams. We’d like to thank Jimmy Zhang, Victor Cui, Zhiyu Li, and Lark Zhang for their work on the FP8 recipe development, experimentation, and integration into NeMo RL.