Data Center / Cloud

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs.

As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not just large language models (LLMs). To make this experience more widely accessible, NVIDIA collaborated with BFL to enable a near real-time editing experience using low-precision quantization.

FLUX.2 is a significant leap forward, offering the public multi-image references and quality comparable to the best enterprise models. However, because FLUX.2 [dev] requires substantial compute resources, BFL, Comfy, and NVIDIA collaborated to achieve a major breakthrough: reducing the FLUX.2 [dev] memory requirement by more than 40% and enabling local deployment through ComfyUI. This optimization, using FP8 precision, has made FLUX.2 [dev] one of the most popular models in the image-generation space.

With FLUX.2 [dev] established as the gold standard for open weight models, the NVIDIA team, in collaboration with BFL, is now excited to share the next leap in performance: 4-bit acceleration for FLUX.2 [dev] on the most powerful data center NVIDIA Blackwell GPUs, including NVIDIA DGX B200 and NVIDIA DGX B300

This post walks through the various inference optimization techniques the team used to accelerate FLUX.2 [dev] on these NVIDIA data center architectures, including code snippets and steps to get started. The combined effect of these optimizations is a remarkable reduction in latency, enabling efficient deployment on data center GPUs.

Visual comparison between BF16 and NVFP4 with FLUX.2 [dev]

Before diving into the specifics, check out the output quality of FLUX.2 [dev] at the default BF16 precision with the remarkably similar results achieved using NVFP4 (Figures 1 and 2).

The prompt for Figure 1 is, A cat naps peacefully on a cozy sofa. The sofa is perched atop a tall tree that grows from the surface of the moon. Earth hangs in the distance, a vibrant blue and green jewel in the darkness of space. A sleek spaceship hovers nearby, casting a soft light on the scene, while the entire digital art composition exudes a dreamlike quality.” 

Two side-by-side images of a cat napping on a sofa on the moon comparing BF16 precision (left) with NVFP4 (right).
Figure 1. Images created using FLUX.2 [dev] with BF16 precision (left) and NVFP4 (right)

The prompt for Figure 2 is, “An oil painting of a couple in formal evening wear going home are caught in a heavy downpour with no umbrellas.” In this case, discrepancies are more challenging to identify. The most prominent is the smile of the gentleman in the BF16 image and multiple umbrellas in the background of the NVFP4 image. Aside from these, the majority of fine details are preserved in both the foreground and background of both images.

Two side-by-side images of a couple walking down a rainy cobblestone street comparing BF16 precision (left) and NVFP4 (right).
Figure 2. Images created using FLUX.2 [dev] with BF16 precision (left) and NVFP4 quantization (right) 

Optimizing FLUX.2 [dev] 

The  FLUX.2 [dev] model consists of three key components: a text embedding model, specifically Mistral Small 3, the diffusion transformer model, and an autoencoder. The NVIDIA team applied the following optimization techniques to the open source diffusers implementation using a prototype runtime staged in the TensorRT-LLM/feat/visual_gen branch:

  • NVFP4 quantization
  • Timestep Embedding Aware Caching (TeaCache) 
  • CUDA Graphs
  • Torch compile
  • Multi-GPU inferencing

NVFP4 quantization

NVFP4 advances the concept of microscaling data formats by introducing a two-level microblock scaling strategy. This approach is designed to minimize accuracy degradation and features two distinct mechanisms: per-tensor scaling and per-block scaling.

Per-tensor scaling is a value stored in FP32 precision, which adjusts the overall tensor distribution and can be statically or dynamically computed. In contrast, per-block scaling is calculated dynamically in real-time by dividing the tensor into blocks of 16 elements.

For maximum flexibility, users can choose to retain specific layers in a higher precision and apply dynamic quantization, as shown in the following example using FLUX.2 [dev]:

exclude_pattern = 
r"^(?!.*(embedder|norm_out|proj_out|to_add_out|to_added_qkv|stream)).*"

The application of the NVFP4 computation is applied using the following statement:

from visual_gen.layers import apply_visual_gen_linear
apply_visual_gen_linear(
    model, 
    load_parameters=True, 
    quantize_weights=True,
    exclude_pattern=exclude_pattern,
)

TeaCache

The TeaCache technique is used to accelerate the inference process. TeaCache conditionally skips a diffusion step by using the previous latent generated during the diffusion process. To quantify this effect, tests were conducted: in a scenario with 20 prompts and a 50-step inference process, TeaCache bypassed an average of 16 steps, which resulted in an approximate 30% reduction in inference latency.

To determine the optimal configuration for the TeaCache hyperparameters, a grid search methodology was employed. The configuration yields the optimal balance between computational speed and generation quality.

dit_configs = {
		...
   "teacache": {
       "enable_teacache": True,
       "use_ret_steps": True,
       "teacache_thresh": 0.05,
       "ret_steps": 10,
       "cutoff_steps": 50,
   },
		...
}

The scaling factor for the caching mechanism was determined empirically and approximated through a third-degree polynomial. This polynomial was fitted using a calibration dataset comprising text-to-image and multireference-image-generation examples.

Figure 3 illustrates this empirical approach, plotting the raw calibration data points alongside the resulting third-degree polynomial curve (shown in red) that models the relationship between the modulated input difference and the model’s output difference.

A log-log scatter plot illustrating the correlation between modulated input difference and model-predicted output difference. The graph compares the current FLUX.2 third-degree polynomial fit (red line) against the FLUX.1 baseline.
Figure 3. The correlation between the modulated input difference and the model-predicted output difference

CUDA Graphs

NVIDIA TensorRT-LLM visual_gen provides a ready-made wrapper to support CUDA Graphs capture. Simply import the wrapper and replace the forward function: 

from visual_gen.utils.cudagraph import cudagraph_wrapper
model.forward = cudagraph_wrapper(model.forward)

Torch compile 

In all of the team’s experiments, torch.compile was enabled except for the baseline run, as it is not enabled in FLUX.2 [dev] by default.

model = torch.compile(model)

Multiple-GPU support

Enabling multiple GPUs using TensorRT-LLM visual_gen involves four steps:

  1. Modify the model.forward function to insert codes handling inter-GPU communication
  2. Replace the attention implementation in your model with ditAttnProcessor
  3. Select parallel algorithm and set parallelism size in config
  4. Launch with torchrun

The following snippet provides an example. Insert the split code to the beginning of model.forward to spread input data across multiple GPUs:

from visual_gen.utils import (
    dit_sp_gather,
    dit_sp_split,
)
# ...
hidden_states = dit_sp_split(hidden_states, dim=1)
encoder_hidden_states = dit_sp_split(encoder_hidden_states, dim=1)
img_ids = dit_sp_split(img_ids, dim=1)
txt_ids = dit_sp_split(txt_ids, dim=1)

Subsequently, insert gather code to the end of model.forward before return:

output = dit_sp_gather(output, dim=1)

Then replace the original attention implementation with the provided attention processor that ensures proper communication across multiple GPUs:

from visual_gen.layers import ditAttnProcessor
# ...
def attention(...):
    # ...
    x = ditAttnProcessor().visual_gen_attn(q, k, v, tensor_layout="HND")
    # ...

Set the correct parallel size in the configuration. For example, to use Ulysses parallelism on four GPUs:

dit_config = {
...
    "parallel": {


        dit_ulysses_size": 4,
    }
...
}

Finally, call the setup_configs API to activate the configs:

visual_gen.setup_configs(**dit_configs)

When using multiple GPUs, the script must be launched with torchrun. TensorRT-LLM visual_gen will use the rank information from torchrun and handle all the communication and job splitting correctly. 

Performance analysis

All the inference optimizations have been included in an end-to-end FLUX.2 [dev] example—low-precision kernels, the caching technique, and multi-GPU inference. 

As shown in Figure 4, the NVIDIA DGX B200 architecture delivers a 1.7x generational leap over NVIDIA H200, even when using default BF16 precision. Further, the layered application of inference optimizations—including CUDA Graphs, torch.compile, NVFP4 precision, and TeaCache—incrementally boosts single-B200 performance from that baseline to a substantial 6.3x speedup. 

Ultimately, multi-GPU inference on a two-B200 configuration achieves a 10.2x performance increase compared to the H200, the current industry standard.

Bar graph showing inference latency difference for FLUX-2.dev model between NVIDIA data center GPUs H200, B200.
Figure 4. Inference latency comparison for FLUX.2 [dev] on NVIDIA B200 GPUs

Baseline is the original FLUX.2 [dev] without any optimization and without torch.compile enabled. The optimized series includes enabling torch.compile, CUDA Graphs, NVFP4, and TeaCache. The number of diffusion steps used was 50 in the benchmarks.

On a single GPU, the team found that NVFP4 and TeaCache provide a good tradeoff between speedup and output quality, delivering approximately 2x speedups each. torch.compile is a near-lossless acceleration technique that most developers are familiar with, but the benefits are limited. CUDA Graphs are mostly beneficial for multi-GPU inference, unlocking incremental scaling using multiple GPUs on NVIDIA B200. Finally, the overall pipeline proves robust with FP8 quantization of the text encoder, providing additional benefits for large-scale deployments.

On multi-GPU, the TensorRT-LLM visual_gen sequence parallelism delivers near-linear scaling as when adding more GPUs. The same effect is observed on NVIDIA Blackwell B200 and GB200 and NVIDIA Blackwell Ultra B300 and GB300 GPUs. Additional optimizations are in progress for NVIDIA Blackwell Ultra GPUs.

A horizontal bar chart titled 'FLUX.2-dev multi-GPU scaling' comparing the speedup of B200, GB200, B300, and GB300 GPUs. The chart shows performance across 1, 2, 4, and 8 GPU configurations, with the B300 demonstrating the highest scaling efficiency, reaching nearly an 8x speedup at the 8-GPU mark.
Figure 5. FLUX.2 [dev] multi-GPU inference scaling on Blackwell GPUs

Get started with FLUX.2 on NVIDIA Blackwell GPUs

FLUX.2 is a significant advancement in image generation, successfully combining high-quality outputs with user-friendly deployment options. The NVIDIA team, in collaboration with BFL, achieved substantial acceleration for FLUX.2 [dev] on the most powerful NVIDIA data center GPUs.

Applying new techniques to the FLUX.2 [dev] model, including NVFP4 quantization and TeaCaching, delivers a powerful generational leap in inference speed. The combined effect of these optimizations is a remarkable reduction in latency, enabling efficient deployment on NVIDIA data center GPUs.

To get started building your own inference pipeline with these state-of-the-art optimizations, check out the end-to-end FLUX.2 example and accompanying code on the NVIDIA/TensorRT-LLM/visual_gen GitHub repo.

Discuss (0)

Tags