Data Center / Cloud

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Jun 23, 2026

By Amr Elmeleegy, Benjamin Chislett, Fernando Xiong, Michael Iovine, Omri Almog, Hao Zhang and Zhijian Liu

Discuss (1)

AI-Generated Summary

Dislike

DFlash, an open source block diffusion model for speculative decoding, significantly accelerates LLM inference on NVIDIA Blackwell GPUs by drafting entire token blocks in parallel, rather than sequentially, and verifying them efficiently with the target model.
Benchmarks show DFlash delivers up to 15x throughput improvement for gpt-oss-120b and nearly doubles interactivity for Llama 3.1 8B at the same concurrency compared to EAGLE-3, with speedups of up to 5.8x for Gemma 4 31B and 5.1x for Qwen3 8-B across tasks on vLLM and SGLang, respectively.
Integration of DFlash into major inference frameworks like SGLang, vLLM, and TensorRT-LLM allows developers to adopt DFlash without code refactoring, leveraging model checkpoints released on Hugging Face for a wide set of NVIDIA GPU architectures and model families.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs generate tokens sequentially, which can limit GPU utilization and constrain throughput in latency-sensitive serving scenarios.

Speculative decoding helps mitigate this bottleneck by using a lightweight model to draft future tokens, which the larger target model then verifies in parallel. DFlash is an open source lightweight block diffusion model designed for speculative decoding that extends this approach with a block-diffusion drafter. This drafter generates an entire block of candidate tokens in a single forward pass, turning sequential drafting into block-parallel GPU work while preserving the target model’s output quality through verification.

DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level. It nearly doubles interactivity for Llama 3.1 8B at the same concurrency compared with state-of-the-art EAGLE-3 speculative decoding.

DFlash is also moving quickly from research into developer workflows. The research team has released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and NVIDIA Hopper GPUs.

In this post, we share the latency-throughput Pareto curve for DFlash running on an NVIDIA Blackwell Ultra system using TensorRT-LLM. We also discuss how DFlash is becoming available more broadly across NVIDIA GPU inference stacks, including SGLang and vLLM.

How does DFlash deliver higher throughput at the same interactivity on NVIDIA Blackwell?

Figure 1 shows the latency-throughput Pareto curve for gpt-oss-120b running with DFlash in TensorRT-LLM on an eight NVIDIA DGX B300 system using the SPEED-Bench coding dataset. Across the curve, DFlash delivers higher throughput at production-relevant latency targets compared with autoregressive decoding. This configuration serves gpt-oss-120b across all eight NVIDIA Blackwell GPUs in the system, providing the GPU memory, compute, and interconnect bandwidth needed to reach high interactivity targets for agentic use cases such as code generation.

At the high interactivity range of 500-600 tokens/sec per user, DFlash increases throughput on NVIDIA Blackwell by more than 15x compared with autoregressive decoding, 1.5x higher than EAGLE-3 speculative decoding. At the lowest concurrency point, with batch size 1, DFlash more than doubles interactivity on Blackwell.

Observing the Pareto curve across a variety of concurrencies is important because serving teams typically optimize for a target interactivity level. Interactive coding, reasoning, and agent workloads often need to maintain strict per-user token latency while scaling concurrency. DFlash improves that tradeoff by adding parallelism to the speculative decode path: its block-diffusion drafter generates multiple candidate tokens at once, and the target model verifies them in parallel.

On NVIDIA Blackwell, this parallelism is especially valuable. In the decode-constrained region, LLM inference is often limited by memory movement and the sequential nature of token generation rather than raw compute. DFlash helps shift part of this work into parallel block drafting and verification, enabling the system to use more of the available compute while maintaining the same interactivity target.

Each NVIDIA Blackwell Ultra GPU combines two reticle-sized dies connected by 10 TB/s of high-bandwidth chip-to-chip interconnect, forming a unified compute domain with 160 SMs and 640 fifth-generation Tensor Cores. DFlash is well matched to this architecture because it exposes more parallel work to Blackwell’s 15 PFLOPS of dense NVFP4 compute, serving up to 15x more users concurrently at the same interactivity rate.

DFlash also shows interactivity speedups over EAGLE-3 speculative decoding across different datasets. The gains extend to smaller models as well, with DFlash nearly doubling performance over EAGLE-3 on Llama 3.1 8B for the Speed-Bench multilingual dataset.

Speedups at Same User Concurrency Levels
	gpt-oss-120b		Llama 3.1 8B Instruct
Dataset	EAGLE-3	DFlash	EAGLE-3	DFlash
Coding	1.8x	2.6x	2.3x	3.0x
RAG	1.7x	2.3x	2.4x	3.1x
Reasoning	1.8x	2.3x	2.5x	2.8x
Writing	1.5x	1.8x	2.3x	2.7x
Multilingual	1.8x	2.6x	1.4x	2.4x
Summarization	1.6x	2.0x	2.3x	2.6x
Average	1.7x	2.3x	2.2x	2.8x

Table 1. DFlash delivers higher interactivity speedups than EAGLE-3 at matched user concurrency levels across different Speed-Bench datasets on gpt-oss-120b and Llama 3.1 8B Instruct

NVIDIA ecosystem brings DFlash to developers without application refactoring

Researchers at UC San Diego released the paper DFlash: Block Diffusion for Flash Speculative Decoding in February 2026 as part of ongoing work on faster, more efficient LLM inference on NVIDIA Blackwell. Built in PyTorch with native CUDA support, DFlash improves decode performance through block-diffusion speculative decoding. NVIDIA and the open source inference community helped ensure strong framework support across both SGLang and vLLM, giving developers a clear path to introduce DFlash into inference deployments on their serving stack of choice.

Since the paper’s release, the research team has released 20 DFlash model checkpoints on Hugging Face with Blackwell and Hopper recipes, covering model families including Qwen, Kimi K2.6, Llama, Gemma, and gpt-oss. The recipes include support for popular inference frameworks such as SGLang and vLLM..

On vLLM, developers can swap EAGLE-3 with a DFlash checkpoint, with no code changes outside of the config. The integration runs through the open source Speculators library, which connects the DFlash drafter to the target model’s hidden states inside the vLLM inference path on NVIDIA GPUs. On Gemma 4 31B running on a single Blackwell Ultra GPU, this path delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding (Table 2).

For SGLang, migrating from EAGLE to DFlash only requires updating the speculative decoding algorithm to DFlash and providing the matching DFlash draft model checkpoint. On Qwen3 8-B running on a single Blackwell GPU, this path delivers up to 5.1x throughput at the same concurrency over autoregressive decoding (Table 3).

This broad early model and framework coverage on NVIDIA GPUs matters because it enables teams to quickly evaluate and deploy new optimizations through the frameworks developers already use, without any application refactoring.

Speedups at Concurrency 1 Gemma-4 31B \| vLLM \| 1x NVIDIA DGX B300
Task	DFlash versus AutoRegressive
Math500	5.8x
GSM8K	5.3x
HumanEval	5.6x
MBPP	4.4x
MT-Bench	3.0x

Table 2. DFlash increases throughput over autoregressive decoding on Gemma 4 31B using vLLM on a single NVIDIA Blackwell Ultra GPU, with speedups up to 5.8x across math, coding, and chat benchmarks

Speedups at Concurrency 1 Qwen3 8-B \| SGLang \| 1x B200
Task	DFlash versus AutoRegressive
Math500	5.1x
HumanEval	4.2x

Table 3. DFlash increases throughput over autoregressive decoding on Qwen3-8B using SGLang on a single NVIDIA B200 GPU, reaching up to 5.1x speedup on Math500 and 4.2x on HumanEval

How does DFlash speculative decoding work?

Speculative decoding has two phases: drafting and verification. A smaller draft model proposes future tokens. The target model verifies those tokens in parallel and accepts the longest valid prefix. If the draft is correct, the system generates multiple tokens with one target-model verification pass.

Traditional speculative decoding methods often use autoregressive draft models. These drafters still generate tokens sequentially, so drafting cost increases as the number of speculative tokens increases. This limits how far the method can push throughput.

DFlash replaces the autoregressive drafter with a lightweight block-diffusion drafter. Instead of generating tokens one by one, the DFlash drafter predicts a block of masked future tokens in a single forward pass.

DFlash combines three key techniques:

Block-diffusion drafting: The drafter predicts multiple future tokens in parallel.
Target hidden-state conditioning: The drafter uses context features extracted from the target model.
KV injection: Target context features are injected into the draft model’s key-value projections across layers, helping maintain high acceptance rates.

This design enables the drafter to be both fast and effective. The target model still performs verification, so DFlash preserves the target model’s output distribution while accelerating generation.

Get started boosting inference performance with DFlash

The research community continues to develop new inference optimizations on NVIDIA GPUs, and DFlash is a strong example of how the NVIDIA ecosystem can make these ideas available to developers quickly.

Ready to get started? DFlash is now available on NVIDIA GPUs across open model checkpoints and is supported in SGLang, vLLM, and TensorRT-LLM.

Discuss (1)

About the Authors

About Amr Elmeleegy
Amr Elmeleegy is a principal product marketing manager for accelerated computing in the data center, focused on the NVIDIA AI inference platform. Previously, he held business development and product marketing roles at AWS and SAP. He holds an MBA from the UC Berkeley Haas School of Business and a bachelor’s degree in electrical engineering from Cairo University.

View all posts by Amr Elmeleegy

About Benjamin Chislett
Benjamin Chislett is a senior software engineer at NVIDIA and a maintainer of the vLLM inference engine. He works on speculative decoding algorithms and performance optimization for LLM inference.

View all posts by Benjamin Chislett

About Fernando Xiong
Fernando Xiong is a senior architect in the Compute Architecture group at NVIDIA, focusing on speculative decoding, performance optimization for LLM inference, and AI agent systems for software engineering. Fernando received his master’s degree in Computer Science from Renmin University of China.

View all posts by Fernando Xiong

About Michael Iovine
Michael Iovine is a senior software engineer at NVIDIA. He currently works on inference optimization for TensorRT-LLM and leads the development of the framework’s speculative decoding module. He holds a bachelor’s degree in Computer Science from the California Institute of Technology.

View all posts by Michael Iovine

About Omri Almog
Omri Almog is a senior product manager in the AI Platform Software group at NVIDIA, responsible for managing products that optimize models for inference. Omri earned his bachelor’s degree from Oregon State University and his master’s degree from the University of California, Santa Barbara.

View all posts by Omri Almog

About Hao Zhang
Hao Zhang is an assistant professor at the Halıcıoğlu Data Science Institute and an affiliate faculty member in the Department of Computer Science and Engineering at the University of California, San Diego. He leads the Hao AI Lab at UC San Diego. In 2023, he co-founded LMNet.ai, which joined forces with Snowflake in November 2023. From 2016 to 2021, Hao worked at the ML platform startup Petuum Inc.

View all posts by Hao Zhang

About Zhijian Liu
Zhijian Liu is an assistant professor at the University of California, San Diego, where he leads Z Lab (z-lab.ai). His research focuses on efficient machine learning and inference for large language, vision, and agentic models. He is also a co-founder of Inco AI. He received his PhD from MIT, where he was advised by Song Han.

View all posts by Zhijian Liu