Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step time can add up to days of training and substantial compute costs. Numerical precision is one of the highest-leverage knobs available, but low- bit mixed-precision pretraining is hard to get right.
To address this, the NVFP4 training recipe in TransformerEngine uses subbyte precision for JAX pretraining. For an end-to-end example, see the recipe in MaxText, a high-performance, scalable LLM framework library. The result is high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell with no measurable accuracy loss compared to the FP8 baseline.
This post explains the NVFP4 format and how it’s built to achieve high performance and accuracy at ultra-low precision. It also shows how to apply a MaxText NVFP4 pretraining recipe and collect performance data showing performance gains. For methodology details, see the NVFP4 pretraining paper.
NVFP4 format and benefits
This NVFP4 introductory post explains its format and how two-level microscaling encodes higher signals with less error than other microscaling formats. It also explains how native hardware support of NVFP4 on the NVIDIA GB300 Grace Blackwell Ultra Superchip delivers 7x GEMM throughput compared to native FP8 precision on the NVIDIA Hopper. That higher throughput, along with the NVFP4 pretraining recipe, shortens training step time with negligible accuracy loss. This enables AI factories to train more and larger models within the same time budget, or train models faster with a shorter time budget.
NVFP4 pretraining recipe
The NVFP4 recipe combines several ingredients that together preserve convergence while unlocking NVIDIA Blackwell and the NVIDIA Rubin platform NVFP4 throughput. To enable efficient narrow-precision training, the pretraining recipe uses several key techniques that have been chosen based on their performance and accuracy.
Five key ingredients work together while maintaining the accuracy required in 4-bit pretraining:
- Micro block scaling uses 16-element blocks, half the size of MXFP4‘s 32-element blocks, so a single outlier has less influence on the shared scale.
- E4M3 block scale factors uses mantissa bits instead of MXFP4’s power-of-two E8M0 scaling, layered under a per-tensor FP32 scale. In an 8B-parameter, 1T token experiment, MXFP4 requires ~36% more tokens to match NVFP4’s final loss.
- Random Hadamard Transform applies only to WGRAD GEMM inputs to Gaussianize outliers. The recipe skips on FPROP and DGRAD because transforming those paths would also require transforming the weight, breaking 2D-scale consistency.
- 2D weight scaling uses one FP8 scale per 16×16 weight block, so FPROP and its transposed DGRAD use the same scale. Activation and gradients keep lower-overhead 1×16 scaling.
- Stochastic rounding uses unbiased rounding to keep tiny updates from being crushed to zero. Weights and activations stay on round-to-nearest-even, where SR would amplify error instead. Both modes are native to Blackwell FP4 conversion instructions.
Figure 1 shows the NVFP4 data flow inside one linear layer.
The three GEMMs: FPROP (forward), DGRAD (activation gradient), and WGRAD (weight gradient) are quantized to NVFP4 only for the MLP (feed-forward) layers of the transformer; the GEMMs inside the attention block (QKV projection, attention output projection, and the score/context matmuls) remain in higher precision.
NVFP4 is applied first to MLP layers because attention’s softmax exponentially amplifies quantization noise on QK^T scores. Attention activations also carry concentrated outliers that 4-bit precision can’t represent well. Because MLPs account for most training FLOPs, this captures most of the speedup without risking convergence.

All three MLP GEMMs consume NVFP4 inputs and emit BF16 outputs, which are eventually folded into an FP32 master weight at the optimizer step. The same path makes the recipe’s convergence‑preserving choices visible: 2D block quantization on the weights (consistent FPROP/DGRAD values across the transpose), a Random Hadamard Transform on the WGRAD inputs (flattens outliers before 4-bit quantization), and stochastic rounding on the gradient quantizers (keeps small updates unbiased).
Enabling NVFP4 in MaxText
The MaxText NVFP4 recipe is available in the JAX-Toolbox GitHub repository. The launch script trains Llama 3 8B with NVFP4 on Blackwell. To enable it, set the quantization flag in MaxText to switch into the NVFP4 path. Two modes are exposed:
- quantization=te_nvfp4: NVFP4 with Random Hadamard Transform. Recommended when the convergence under te_nvfp4_no_rht is not satisfactory.
- quantization=te_nvfp4_no_rht: NVFP4 without RHT. Lowest overhead, but may degrade convergence quality.
Run the example script from the MaxText repository root inside a container that has JAX, NVIDIA Transformer Engine, and the required NVIDIA CUDA/cuDNN libraries installed. The public NVIDIA MaxText container ghcr.io/nvidia/jax:maxtext is recommended.
The following is a partial example of the Llama3 8B MaxText NVFP4 training script, which declares the nvfp4 argument through Transformer Engine:
RUN_SETTINGS="-m maxtext.trainers.pre_train.train maxtext/configs/base.yml run_name=debug_run base_output_directory=./debug_logs hardware=gpu dataset_type=synthetic model_name=llama3-8b remat_policy='minimal_with_context_and_quantization' scan_layers=False attention='cudnn_flash_te' steps=50 dtype=bfloat16 max_target_length=8192 per_device_batch_size=4 ici_data_parallelism=${ici_DP} dcn_data_parallelism=${dcn_DP} ici_fsdp_parallelism=${ici_FSDP} dcn_fsdp_parallelism=${dcn_FSDP} profiler=nsys enable_checkpointing=false override_model_config=True gradient_accumulation_steps=1 quantization=te_nvfp4_no_rht max_segments_per_seq=32"
After launch, MaxText prints step time, TFLOP/s/device, and tokens/s/device. An NVIDIA Nsight Systems trace is written to base_output_directory for inspection. To produce the FP8 baseline used in the comparison below, run the same script with quantization=te_fp8_delayedscaling.
Performance results
The benchmark uses MaxText pre-training on Llama 3 8B with FSDP=4 with a sequence length of 8,192, a per-device batch size of 4, and 50 steps inside the public ghcr.io/nvidia/jax:maxtext container.
Table 1 summarizes MaxText pretraining performance on the NVIDIA GB200 Grace Blackwell Superchip and NVIDIA GB300 Grace Blackwell Ultra Superchip for Llama 3 8B and Llama 3.1 405B, comparing the NVFP4 recipe against an FP8 baseline on the same hardware, parallelism, and global batch size. Numbers are measured at sequence length 8,192.
| Model | Hardware | # GPUs | FSDP | MBS | GBS | Seq len |
| Llama3 8B | GB200 | 4 | 4 | 4 | 16 | 8,192 |
| Llama3 8B | GB300 | 4 | 4 | 4 | 16 | 8,192 |
| Llama 3.1 405B | GB200 | 128 | 128 | 1 | 128 | 8,192 |
| Llama 3.1 405B | GB300 | 128 | 128 | 1 | 128 | 8,192 |
| Model | Hardware | FP8 Per GPU TFLOPs | NVFP4 Per GPU TFLOPs | Speedup vs FP8 |
| Llama 3 8B | GB200 | 1497 | 2017 | 1.35× |
| Llama 3 8B | GB300 | 1759 | 2301 | 1.31× |
| Llama 3.1 405B | GB200 | 1557 | 2241 | 1.44× |
| Llama 3.1 405B | GB300 | 2103 | 3633 | 1.73× |
Figure 2 shows per‑GPU sustained TFLOP/s across the four baseline configurations. NVFP4 delivers an additional 500–700 TF/s per GPU on every configuration. The 1.31–1.73x speedup over the FP8 baseline is from changing the GEMM precision while holding the model, hyperparameters, parallelism, and global batch size identical.
The largest relative gains are with the 405B configurations (1.44x on GB200, 1.73x on GB300), where the per‑step GEMM mass dominates FSDP collective overhead and a precision‑level speedup translates directly into wall‑clock savings.

Figure 3 overlays Llama 3 8B training loss for the FP8 baseline and NVFP4 across 10,000 pretraining steps with otherwise identical hyperparameters. Both runs descend the same curve from ≈12.2 nats to ≈3.9 nats, with a converged‑regime mean gap of just +0.026 nats, well inside step‑to‑step noise. The NVFP4 speedups in Figure 2 come with no measurable accuracy cost.

Get started
Pull the MaxText container, run nvfp4_example.sh on Blackwell to get started.
Acknowledgments
For their contributions to NVFP4 enablement in JAX, XLA, and TE, special thanks to Jaroslav Sevcik, Ilia Sergachev, Johannes Reifferscheid, Phuong Nguyen, and Jeremy Berchtold.