Agentic AI / Generative AI

Accelerating Diffusion Models with an Open, Plug-and-Play Offering

NVIDIA Research has built a new open source library called NVIDIA FastGen that unifies state-of-the-art video diffusion distillation techniques.

Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks.

Despite these successes, sampling inefficiency remains a fundamental bottleneck. Standard diffusion models require tens to hundreds of iterative denoising steps, leading to high inference latency and substantial computational cost. This limits practical deployment in interactive applications, edge devices, and large-scale production systems.

Video generation faces an especially critical challenge. Open source models such as NVIDIA Cosmos—along with commercial text-to-video (T2V) systems —have shown remarkable text-to-video capabilities. However, video diffusion models are orders of magnitude more computationally demanding due to the temporal dimension. Generating a single video can take minutes to hours, making real-time video generation, interactive editing, and world modeling for agent training challenging.

Accelerating diffusion sampling without sacrificing quality and diversity has emerged as a key open challenge, with video generation being one of the most demanding and impactful applications to solve.

This blog introduces NVIDIA FastGen, an open source library that unifies state-of-the-art diffusion distillation techniques for accelerating many-step diffusion models into one-step or few-step generators. We review trajectory-based and distribution-based distillation approaches, demonstrate reproducible benchmarking showing 10x to 100x sampling speedups with maintained quality, and showcase FastGen’s scalability to large video models up to 14B parameters. We also highlight applications to interactive world modeling, where causal distillation enables real-time video generation. 

What are the key approaches to acceleration?

A growing body of research has explored diffusion distillation, which aims to compress long denoising trajectories into a small number of inference steps. Existing approaches broadly fall into two categories:

  1. Trajectory-based distillation—including progressive distillation and consistency models such as OpenAI’s iCT and sCM, and Massachusetts Institute of Technology and Carnegie Mellon University’s MeanFlow—directly regresses the teacher’s denoising trajectories.
  2. Distribution-based distillation—such as Stability.AI’s LADD, and MIT and Adobe’s DMD—aligns the student and teacher distributions using adversarial or variational objectives.

These methods have successfully reduced diffusion sampling to one or two steps in the image domain. However, each family comes with notable tradeoffs. Trajectory-based methods often suffer from training instability, slow convergence, and scalability challenges, while distribution-based methods tend to be memory-intensive, sensitive to initialization, and prone to mode collapse. Moreover, none of these approaches alone consistently achieves one-step generation with high fidelity for complex data such as real-world videos.

This motivates the need for a unified and extensible framework that can integrate, compare, and evolve diffusion distillation methods toward stable training, high-quality generation, and scalability to large models and complex data.

What FastGen offers

FastGen is a new, open source, versatile library that brings together state-of-the-art diffusion distillation methods under a generic, plug-and-play interface.

Unified and flexible interface

FastGen provides a unified abstraction for accelerating diffusion models across diverse tasks. Users provide their diffusion model (and, optionally, training data) and select a suitable distillation method. FastGen then handles the training and inference pipeline, converting the original model into a one-step or few-step generator with minimal engineering overhead.

A diagram showing how users can easily use FastGen to turn their multi-step diffusion model into a one-step generator with a good preservation of generation quality.
Figure 1. FastGen distillation pipeline.

Reproducible benchmarks and fair comparisons

FastGen reproduces all supported distillation methods on standard image generation benchmarks. Historically, diffusion distillation methods have been proposed and evaluated in isolated codebases with different training recipes, making fair comparisons difficult. By unifying implementations and hyperparameter choices, FastGen enables transparent benchmarking and serves as a common evaluation platform for the few-step diffusion community.

Table 1 below presents a comprehensive comparison of distillation method performance on CIFAR-10 and ImageNet-64 benchmarks, demonstrating FastGen’s reproducibility. The table shows one-step image generation quality achieved by FastGen’s unified implementations alongside the original results reported in their respective papers (shown in parentheses). Each method is categorized by its distillation approach: trajectory-based methods that optimize along the diffusion trajectory (ECT, TCM, sCT, sCD, MeanFlow) and distribution-based methods that directly match generated distributions (LADD, DMD2, f-distill).

Acceleration methods, as noted in the research papers linked hereImage generation, with Fréchet inception distance (FID) scores to represent quality 
CIFAR-10ImageNet-64
Trajectory-based distillationECT (Geng et al., 2024)2.92 FID from FastGen (3.60 reported in research paper) 4.05 from FastGen (4.05 reported in research paper)
TCM (Lee et al., 2025)2.70 (2.46)2.23 (2.20)
sCT (Lu et al., 2025)3.23 (2.85)
sCD (Lu et al., 2025)3.23 (3.66)
MeanFlow (Geng et al., 2025)2.82 (2.92)
Distribution-based distillationLADD (Sauer et al., 2024)
DMD2 (Yin et al., 2024)1.99 (2.13)*1.12 (1.28)
f-distill (Xu et al., 20251.85 (1.92)*1.11 (1.16)
Table 1. One-step image generation quality measured by Fréchet inception distance (FID), reproduced on standard image generation benchmarks from the FastGen library. The FID numbers in parentheses are from the original papers (* indicates conditional CIFAR-10)

Beyond vision tasks

While we demonstrate FastGen on vision tasks in this blog, the library is generic enough to accelerate any diffusion model across different domains. One area of particular interest is AI-for-science applications, where sample quality is often as important as sample diversity. 

By decoupling distillation methods from network definitions, FastGen makes it straightforward and plug-and-play to add new models. For example, we have successfully distilled the NVIDIA weather downscaling model, Corrector Diffusion (CorrDiff), in NVIDIA PhysicsNeMo using ECT for one-step Km-scale atmospheric downscaling.

As visualized in Figure 2 below, the distilled model matches the predictions of CorrDiff (in terms of skill and spread) while allowing for 23x faster inference. 

A GIF showing how the one-step eastward wind predictions of the distilled ECT model are visually almost indistinguishable from the 18-step predictions of the CorrDiff teacher model.
Figure 2. Observations of the eastward wind at 10 meters during Typhoon Chanthu (left top), four predictions of the distilled one-step model (top right), and the 18-step CorrDiff model  (bottom right)

Scalable and efficient infrastructure

FastGen also provides a highly optimized training infrastructure for scaling diffusion distillation to large models. Supported techniques include:

  • Fully Sharded Data Parallel v2 (FSDP2)
  • Automatic mixed precision (AMP)
  • Context parallelism (CP)
  • Flex attention
  • Efficient KV cache management
  • Adaptive finite-difference JVP estimation

With these optimizations, FastGen can distill large-scale models efficiently. For example, we successfully distilled a 14B Wan2.1 T2V model into a few-step generator using DMD2, achieving convergence in 16 hours on 64 NVIDIA H100 GPUs.

Figure 3 shows a visual comparison of 50-step teacher and two-step distilled student using the improved DMD2 method to distill Wan2.1-T2V-14B. Although the student is 50x faster than the teacher in sampling, the student’s generation quality closely matches the teacher’s. 

 A GIF showing how the improved DMD2 method generates a dog video of similar visual quality as the Wan teacher while reducing NFEs by 50x.
Figure 3. Visual comparison of 50-step teacher with CFG=6 (NFE=100) (left) and two-step distilled student (NFE=2) (right), using the improved DMD2 to distill Wan2.1-T2V-14B. NFE denotes the number of function evaluations during generation

FastGen for interactive world modeling

Interactive world models aim to simulate environment dynamics and respond coherently to user actions or agent interventions in real time. They require:

  • High sampling efficiency
  • Long-horizon temporal consistency
  • Action-conditioned controllability

Video diffusion models provide a strong foundation for world modeling due to their ability to capture rich visual dynamics, but their multi-step sampling process and passive formulation prevent real-time interaction.

To address this, recent work has explored causal distillation, which transforms a bidirectional video diffusion model into a few-step, block-wise autoregressive model. This autoregressive structure enables real-time interaction and has become a promising foundation for interactive world models.

FastGen implements both training and inference recipes for multiple causal distillation methods, including CausVid and Self-Forcing, where default formulations are primarily distribution-based.

Trajectory-based distillation has not yet been widely applied in causal distillation due to performance degradation and trajectory misalignment between bidirectional teacher models and autoregressive students. FastGen addresses these challenges in two ways:

  1. Warm-starting causal distillation: Trajectory-based methods can be used to initialize student models before applying distribution-based objectives.
  2. Causal SFT via diffusion forcing: FastGen provides a causal supervised fine-tuning (SFT) recipe that first trains a many-step block-wise autoregressive model, which then serves as a new teacher for trajectory-based distillation.

These components enable hybrid distillation pipelines that combine the stability of trajectory-based methods with the flexibility of distribution-based objectives.

On the application side, FastGen supports a wide range of open source video diffusion models, including Wan2.1, Wan2.2, and NVIDIA Cosmos-Predict2.5, and provides end-to-end acceleration for multiple video synthesis scenarios:

  • Text-to-video (T2V)
  • Image-to-video (I2V)
  • Video-to-video (V2V)

Users can flexibly customize causal distillation pipelines, for example, scaling from 2B to 14B models, adding first-frame conditioning for I2V, or incorporating structural priors such as depth-guided driving videos for V2V tasks.

Therefore, FastGen provides the essential infrastructure for advancing interactive world models—enabling the fast, controllable, and temporally consistent generation needed to transform diffusion models from passive synthesizers into real-time interactive systems.

Get started

FastGen is designed to be more than a collection of distillation techniques—it is a unified research and engineering platform for accelerating diffusion models. By bringing together trajectory-based and distribution-based methods under a scalable and reproducible framework, FastGen lowers the barrier to experimenting with few-step diffusion models and enables fair benchmarking across approaches.

Try out FastGen today—plug in your own diffusion model, choose a distillation approach, and watch a multistep generator transform into a one-step performer. Whether you aim to accelerate visual synthesis or scientific discovery, or power interactive world models, FastGen offers the flexibility and reproducibility to move from idea to implementation in record time.

Discuss (0)

Tags