Data Center / Cloud

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs generate tokens sequentially, which can limit GPU utilization and constrain throughput in latency-sensitive serving scenarios.

Speculative decoding helps mitigate this bottleneck by using a lightweight model to draft future tokens, which the larger target model then verifies in parallel. DFlash is an open source lightweight block diffusion model designed for speculative decoding that extends this approach with a block-diffusion drafter. This drafter generates an entire block of candidate tokens in a single forward pass, turning sequential drafting into block-parallel GPU work while preserving the target model’s output quality through verification.

DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level. It nearly doubles interactivity for Llama 3.1 8B at the same concurrency compared with state-of-the-art EAGLE-3 speculative decoding.

DFlash is also moving quickly from research into developer workflows. The research team has released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and NVIDIA Hopper GPUs.

In this post, we share the latency-throughput Pareto curve for DFlash running on an NVIDIA Blackwell Ultra system using TensorRT-LLM. We also discuss how DFlash is becoming available more broadly across NVIDIA GPU inference stacks, including SGLang and vLLM.

How does DFlash deliver higher throughput at the same interactivity on NVIDIA Blackwell? 

Figure 1 shows the latency-throughput Pareto curve for gpt-oss-120b running with DFlash in TensorRT-LLM on an eight NVIDIA DGX B300 system using the SPEED-Bench coding dataset. Across the curve, DFlash delivers higher throughput at production-relevant latency targets compared with autoregressive decoding. This configuration serves gpt-oss-120b across all eight NVIDIA Blackwell GPUs in the system, providing the GPU memory, compute, and interconnect bandwidth needed to reach high interactivity targets for agentic use cases such as code generation. 

At the high interactivity range of 500-600 tokens/sec per user, DFlash increases throughput on NVIDIA Blackwell by more than 15x compared with autoregressive decoding, 1.5x higher than EAGLE-3 speculative decoding. At the lowest concurrency point, with batch size 1, DFlash more than doubles interactivity on Blackwell. 

Observing the Pareto curve across a variety of concurrencies is important because serving teams typically optimize for a target interactivity level. Interactive coding, reasoning, and agent workloads often need to maintain strict per-user token latency while scaling concurrency. DFlash improves that tradeoff by adding parallelism to the speculative decode path: its block-diffusion drafter generates multiple candidate tokens at once, and the target model verifies them in parallel.

On NVIDIA Blackwell, this parallelism is especially valuable. In the decode-constrained region, LLM inference is often limited by memory movement and the sequential nature of token generation rather than raw compute. DFlash helps shift part of this work into parallel block drafting and verification, enabling the system to use more of the available compute while maintaining the same interactivity target.

Each NVIDIA Blackwell Ultra GPU combines two reticle-sized dies connected by 10 TB/s of high-bandwidth chip-to-chip interconnect, forming a unified compute domain with 160 SMs and 640 fifth-generation Tensor Cores. DFlash is well matched to this architecture because it exposes more parallel work to Blackwell’s 15 PFLOPS of dense NVFP4 compute, serving up to 15x more users concurrently at the same interactivity rate.

DFlash also shows interactivity speedups over EAGLE-3 speculative decoding across different datasets. The gains extend to smaller models as well, with DFlash nearly doubling performance over EAGLE-3 on Llama 3.1 8B for the Speed-Bench multilingual dataset.

Speedups at Same User Concurrency Levels 
 gpt-oss-120bLlama 3.1 8B Instruct 
DatasetEAGLE-3DFlash EAGLE-3DFlash 
Coding1.8x2.6x2.3x3.0x
RAG1.7x2.3x2.4x3.1x
Reasoning1.8x2.3x2.5x2.8x
Writing1.5x1.8x2.3x2.7x
Multilingual1.8x2.6x1.4x2.4x
Summarization1.6x2.0x2.3x2.6x
Average1.7x2.3x2.2x2.8x
Table 1. DFlash delivers higher interactivity speedups than EAGLE-3 at matched user concurrency levels across different Speed-Bench datasets on gpt-oss-120b and Llama 3.1 8B Instruct

NVIDIA ecosystem brings DFlash to developers without application refactoring

Researchers at UC San Diego released the paper DFlash: Block Diffusion for Flash Speculative Decoding in February 2026 as part of ongoing work on faster, more efficient LLM inference on NVIDIA Blackwell. Built in PyTorch with native CUDA support, DFlash improves decode performance through block-diffusion speculative decoding. NVIDIA and the open source inference community helped ensure strong framework support across both SGLang and vLLM, giving developers a clear path to introduce DFlash into inference deployments on their serving stack of choice.  

Since the paper’s release, the research team has released 20 DFlash model checkpoints on Hugging Face with Blackwell and Hopper recipes, covering model families including Qwen, Kimi K2.6, Llama, Gemma, and gpt-oss. The recipes include support for popular inference frameworks such as SGLang and vLLM.. 

On vLLM, developers can swap EAGLE-3 with a DFlash checkpoint, with no code changes outside of the config. The integration runs through the open source Speculators library, which connects the DFlash drafter to the target model’s hidden states inside the vLLM inference path on NVIDIA GPUs. On Gemma 4 31B running on a single Blackwell Ultra GPU, this path delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding (Table 2).

For SGLang, migrating from EAGLE to DFlash only requires updating the speculative decoding algorithm to DFlash and providing the matching DFlash draft model checkpoint. On Qwen3 8-B running on a single Blackwell GPU, this path delivers up to 5.1x throughput at the same concurrency over autoregressive decoding (Table 3).

This broad early model and framework coverage on NVIDIA GPUs matters because it enables teams to quickly evaluate and deploy new optimizations through the frameworks developers already use, without any application refactoring.

Speedups at Concurrency 1
Gemma-4 31B | vLLM | 1x NVIDIA DGX B300
TaskDFlash versus AutoRegressive
Math5005.8x
GSM8K5.3x
HumanEval5.6x
MBPP4.4x
MT-Bench3.0x
Table 2. DFlash increases throughput over autoregressive decoding on Gemma 4 31B using vLLM on a single NVIDIA Blackwell Ultra GPU, with speedups up to 5.8x across math, coding, and chat benchmarks
Speedups at Concurrency 1
Qwen3 8-B | SGLang | 1x B200
TaskDFlash versus AutoRegressive
Math5005.1x
HumanEval4.2x
Table 3. DFlash increases throughput over autoregressive decoding on Qwen3-8B using SGLang on a single NVIDIA B200 GPU, reaching up to 5.1x speedup on Math500 and 4.2x on HumanEval

How does DFlash speculative decoding work?

Speculative decoding has two phases: drafting and verification. A smaller draft model proposes future tokens. The target model verifies those tokens in parallel and accepts the longest valid prefix. If the draft is correct, the system generates multiple tokens with one target-model verification pass.

Traditional speculative decoding methods often use autoregressive draft models. These drafters still generate tokens sequentially, so drafting cost increases as the number of speculative tokens increases. This limits how far the method can push throughput.

DFlash replaces the autoregressive drafter with a lightweight block-diffusion drafter. Instead of generating tokens one by one, the DFlash drafter predicts a block of masked future tokens in a single forward pass.

DFlash combines three key techniques:

  • Block-diffusion drafting: The drafter predicts multiple future tokens in parallel.
  • Target hidden-state conditioning: The drafter uses context features extracted from the target model.
  • KV injection: Target context features are injected into the draft model’s key-value projections across layers, helping maintain high acceptance rates.

This design enables the drafter to be both fast and effective. The target model still performs verification, so DFlash preserves the target model’s output distribution while accelerating generation.

Get started boosting inference performance with DFlash 

The research community continues to develop new inference optimizations on NVIDIA GPUs, and DFlash is a strong example of how the NVIDIA ecosystem can make these ideas available to developers quickly. 

Ready to get started? DFlash is now available on NVIDIA GPUs across open model checkpoints and is supported in SGLang, vLLM, and TensorRT-LLM. 

Discuss (0)

Tags