How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools. Deploying these models and workflows in production environments requires distributing them across multiple GPU nodes, which demands careful orchestration and coordination across GPUs.

NVIDIA Dynamo 1.0—available now—addresses these problems by accelerating generative AI and reasoning models in large-scale distributed environments. The AI framework delivers low-latency, high-throughput, distributed inference for production-grade multi-node AI deployments.

Dynamo supports leading open source inference engines, including SGLang, NVIDIA TensorRT LLM, and vLLM. It also has delivered strong results in trusted third-party benchmarks such as MLPerf and SemiAnalysis InferenceX, reinforcing its position as a production-grade inference platform. Dynamo can boost the number of requests served by up to 7x on NVIDIA Blackwell, as demonstrated in the recent SemiAnalysis InferenceX benchmark.

SemiAnalysis InferenceX, updated March 3, 2026. Results for DeepSeek R1-0528, FP4, 1k/1k, interactivity: ~50 tok/sec/user.

This blog details how early adopters have integrated Dynamo into real-world inference workflows, the system level performance improvements achieved, and the latest features and optimizations added to the framework.

Early adopters and real-world impact

At last year’s GTC event, NVIDIA introduced NVIDIA Dynamo, a low-latency and high-throughput, distributed inference framework built for multinode AI deployments. Since then, NVIDIA has worked collaboratively with the open source ecosystem to harden Dynamo for production-grade performance and large-scale workloads. Over this period, Dynamo has achieved significant milestones:

Successfully deployed in production workflows: AstraZeneca, Baseten, ByteDance, CoreWeave, Crusoe, DigitalOcean, Gcore, GMI Cloud, Nebius, Meituan, Pinterest, Prime Intellect, Rednote, SoftBank Corp., Tencent Cloud, Together AI, Vultr, and many more have deployed Dynamo in production to scale multi-node inference, optimize throughput, and improve latency. Watch Dynamo Day recordings to hear directly from organizations deploying Dynamo.
Integrated into managed Kubernetes environments: Alibaba Cloud, Amazon Web Services (AWS), Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure (OCI) have built integrations showing how Dynamo can be seamlessly deployed into their managed Kubernetes environments, scaling inference to meet the growing demand for AI.
Adopted by major open source frameworks: Modular Dynamo components such as NIXL have been widely adopted by inference engines including llm-d, NVIDIA TensorRT LLM, SGLang, and vLLM to accelerate KV cache transfers between GPUs. LMCache has integrated its KV caching directly into storage solutions in Dynamo, SGLang has integrated its HiCache solution into Dynamo’s Router, and LangChain has built an integration that injects agentic hints for Dynamo’s Router, validating its composable architecture.
Inspired contributions from across the AI ecosystem: Developers across the AI community have contributed to Dynamo and broadened its capabilities. Mooncake and Alibaba extended the Dynamo AIConfigurator with SGLang support; Microsoft tested and hardened Dynamo on Azure Kubernetes Service (AKS), contributing fixes, deployment guides, public demos, and Planner/AIConfigurator enhancements; Prime Intellect co‑designed and integrated LoRA adapter support; and Baseten validated early Dynamo features in production‑like environments, then upstreamed bug fixes and hardening patches.
Enabled integration with storage solutions: Cloudian, DDN, Dell, Everpure (previously Pure Storage), HPE, IBM, NetApp, VAST, and WEKA have integrated Dynamo into their AI solutions. That allows inference workloads to scale beyond GPU memory constraints to support very large context lengths with storage.

Dynamo 1.0 builds on these milestones while marking the framework’s maturity and production readiness. Keep reading for more highlights about the update.

Accelerating agentic inference by 4x with Dynamo and NVIDIA NeMo Agent Toolkit

Today’s inference runtimes treat every request and KV cache block the same—a system prompt reused across many turns has the same eviction priority as a one-off chain-of-thought. Multi-turn agents, however, reuse prefixes and follow predictable patterns. An evicted multi-turn KV block will need to be recomputed, resulting in wasted compute and higher inference costs. Dynamo addresses this gap with new agentic inference optimizations:

Dynamo frontend API: Accepts agent hints (per-request metadata such as latency sensitivity, expected output length, and cache control) and passes them to the router and KV cache manager.
Dynamo KV-aware router: Uses priority and latency agentic hints to control queue ordering so user-facing turns run before background work. It can take in expected output sequence length (OSL) to improve load-balancing accuracy.
Dynamo KV cache manager: Supports experimental cache pinning. Pinned nodes resist eviction for the specified duration, and are moved to host memory rather than being deleted.

The community has built on these optimizations to create custom routing and integrate agent hints into popular frameworks such as LangChain’s ChatNVIDIADynamo and the NVIDIA NeMo Agent Toolkit.

Running Dynamo and the NeMo Agent Toolkit demonstrated up to 4x lower TTFT and 1.5x higher throughput when running the Llama 3.1 model on NVIDIA Hopper.

Advancing multimodal inference optimization

Dynamo 1.0 introduces three new features designed to accelerate multimodal inference in image-heavy workloads—where image encoding can be a bottleneck:

Disaggregated encode/prefill/decode (E/P/D): Instead of running E/P/D on the same GPU, Dynamo separates them into distinct stages with independent scaling. Running the encode phase on dedicated workers allows for independent scaling, which improves batching, memory efficiency, and overall throughput.
Multimodal embedding cache: A CPU-backed least recently used (LRU) cache stores computed image embeddings off-GPU so repeated images skip encoding entirely. This applies to both disaggregated and aggregated setups.
Multimodal KV routing: Multimodal KV routing extends Dynamo’s KV-aware router to account for image content. A dedicated multimodal router downloads images then selects the backend worker with the highest cache overlap, including overlap on blocks containing images.

Running the Qwen3-VL-30B-A3B-Instruct-FP8 multimodal model on NVIDIA GB200, Dynamo’s embedding cache accelerated time to first token (TTFT) by up to 30% and throughput by up to 25% on image requests.

Adding native support for video generation

New video-generation models are setting a new bar for cinematic quality and motion realism. But serving them efficiently is non-trivial: Their inference workloads are compute- and memory-intensive, especially at high resolutions.

Dynamo 1.0 adds native support for video-generation models, with integrations for leading open source inference frameworks such as FastVideo, SGLang Diffusion, TensorRT LLM Diffusion, and vLLM-Omni. This brings Dynamo’s modular stack—including its low-overhead front end, streaming capabilities, and high-efficiency scheduling engine—to modern video workloads.

This integration demonstrates that state‑of‑the‑art video generation can be delivered efficiently on Dynamo. For a step‑by‑step walkthrough of how to deploy video generation models with Dynamo, check out this how‑to guide.

Video 1. Generating a 5-second video in ~40 seconds on a single NVIDIA Hopper GPU using Wan2.1 and SGLang Diffusion running on NVIDIA Dynamo.

Accelerating inference startup by 7x with Dynamo ModelExpress

Modern inference clusters are constantly spinning new replicas up and down in response to traffic. Each new process has to repeat the same heavy startup pipeline:

Downloading model checkpoints
Loading weights from remote or shared storage
Applying model optimizations
Compiling kernels
Building NVIDIA CUDA graphs

To solve that challenge, Dynamo ensures that the expensive parts of worker startup are done once and reused many times through two new ModelExpress capabilities:

Checkpoint restore: Instead of treating every replica as a fresh boot, Dynamo runs the full initialization sequence a single time, captures the “ready‑to‑serve” state to persistent storage, and then brings new replicas online by restoring from that checkpoint rather than rebuilding everything from scratch.

Model weight streaming: Rather than having each new worker independently download model weights, write them to local or shared storage, and then load them into GPU memory, ModelExpress loads the model once on an initial worker and streams the weights to additional workers over high-bandwidth interconnects using NVIDIA Inference Xfer Library (NIXL) and NVIDIA NVLink, eliminating reliance on storage bandwidth.

For large models, especially in fleets that scale aggressively, model weight streaming can accelerate model loading time by up to 7x for large MoE models like DeepSeek v3 on NVIDIA H200.

Scaling Kubernetes on NVIDIA GB300 NVL72

NVIDIA Grove, an open source API that’s part of Dynamo, simplifies deploying hierarchical gang-scheduled, topology‑aware AI workloads on Kubernetes. In Dynamo 1.0, Grove adds setup automation for NVIDIA NVLink fabric on rack‑scale systems such as NVIDIA GB300 NVL72. That allows users to define placement policies across every layer of infrastructure—from cloud regions and availability zones down to data centers, network blocks, racks, hosts, and even non-uniform memory access (NUMA) nodes.

Traditionally, using the NVIDIA GB300 NVL72 NVLink fabric required users to manually define and manage compute domains. This release introduces a unified topology API that enables developers to seamlessly colocate prefill and decode on the same NVIDIA NVL72 rack to optimize KV cache transfers, confine an inference stack to a single data center for latency needs, and place frontend services on nearby CPU‑only nodes for efficient request handling. Grove integrates with advanced AI schedulers, like KAI scheduler, to ensure these constraints are enforced.

Integration with the Kubernetes Inference Gateway

A previous Dynamo release introduced a plugin that allows users to combine the Kubernetes-native Inference Gateway extension routing and Dynamo’s KV-aware router.

In a typical Dynamo setup, routing is handled by Dynamo’s KV-aware router. The router evaluates worker queue depth and relevant KV cache information on each worker, then makes a probabilistic decision using a weighted combination of these factors.

Dynamo’s KV-aware router can run inside the Inference Gateway to benefit from integration with routing plugins, filters, and other gateway capabilities in Kubernetes-based environments.

Deploying fast, latency-aware inference with zero configurations

Deploying large models requires deep expertise that balances latency, throughput, and cost targets through complex scaling and configuration steps. Dynamo’s new Dynamo Graph Deployment Request (DGDR) removes that friction by providing a simple, one‑step path from service‑level objectives (SLOs) to optimized inference deployments.

DGDR combines the intelligence of the planner and AIConfigurator into a unified, Kubernetes‑native deployment flow. Instead of navigating multiple tools, scripts, and guesswork, developers can now specify a model, target hardware, and traffic goals in a YAML—soon, through an intuitive web UI—and Dynamo handles the rest.

Behind the scenes, the AIConfigurator runs rapid, simulation‑based recommendations for quick iteration, while the planner engages deeper on‑cluster profiling for precise, production‑grade optimization. Both routes deliver an auto-deployable Dynamo Graph Deployment (DGD) that meets the user’s desired cost, performance, and scalability balance, without having to hand-configure a deployment configuration.

Video 2. Watch zero-config deploy, generate and launch an optimized inference cluster directly from SLO inputs—automating scaling, profiling, and configuration.

Increasing resiliency with fault detection and request migration

A key design principle in Dynamo is to be resilient by default so applications keep running even when individual workers fail or hang. The updated Dynamo fault tolerance combines two pillars:

Early fault detection: Dynamo adds a framework-independent “canary health check” that probes workers on a configurable schedule. If these checks do not receive a valid response, the worker is marked unhealthy and is removed from routing. Additionally, the Dynamo frontend also performs active detection using network-level signals. If establishing a new stream to a worker fails, or an existing stream ends unexpectedly mid-request, that worker is immediately removed from the set of active workers (for about five seconds) so no new requests are sent to it.

Request cancellation and migration: Request cancellation support is enabled out-of-the-box, allowing in-flight work to be terminated when it no longer makes sense to continue. When a worker becomes unavailable, Dynamo can migrate affected requests to another worker and resume processing, preserving the request itself rather than forcing the client to resubmit from scratch. This ensures failures do not automatically translate into user-visible errors.

With Dynamo’s new layered health detection combined with cancellation and migration, Dynamo aims to keep LLM applications responsive even when individual workers fail.

Advancing KV caching to storage

In Dynamo 1.0, KV Block Manager (KVBM) introduces several features that enhance flexibility, visibility, and deployment options:

Object storage support: KVBM now works with the Amazon Simple Storage Service (S3) and Azure-style blob APIs used by major storage vendors and cloud providers. This allows model operators to integrate KVBM with existing file systems, S3, or other cloud object stores without building separate KV offload pipelines for each backend.
Global KV event emission: KVBM emits events whenever KV blocks move between storage tiers (GPU memory, CPU memory, local SSD, and remote storage) or are evicted. The KV router’s indexer consumes these events to maintain a consistent, cluster-wide view of KV block locations, enabling smarter routing and improved cache reuse across multiple model replicas and inference engines.
Pip-installable module: KVBM can now be installed directly into inference engines like vLLM or TensorRT LLM without requiring the complete Dynamo stack. Teams using different inference frameworks can share a common KV offload tool rather than re-implementing eviction policies and storage integrations.

Looking ahead

Looking forward, the Dynamo product roadmap will focus on expanding multimodal capabilities to support richer and more context-aware interactions, advancing diffusion-based models to unlock real-time higher quality video-generation capabilities, and scaling agentic workloads and reinforcement learning. Dynamo is being built in the open with the community. To get involved, explore the code and issues in the NVIDIA GitHub repository, drop into the biweekly Dynamo office hours, and dive into the existing technical blogs.

Acknowledgments

Akshatha Kamath, Anish Maddipoti, Anna Tchernych, Ben Hamm, Biswa Ranjan Panda, Dhruv Nandakumar, Ekin Karabulut, Ganesh Kudleppanavar, Hannah Simmons, Hannah Zhang, Harry Kim, Hongkuan Zhou, Hyunjae Woo, Ishan Dhanani, Itay Neeman, Jacky Hui, Jakub Kosek, John Kim, Kavin Krishnan, Kyle Kranen, Maksim Khadkevich, Michael Demoret, Moein Khazraee, Neal Vaidya, Neelay Shah, Qi Wang, Ryan McCormick, Sanjay Chatterjee, Schwinn Saereesitthipitak, Suman Tatiraju, Vikram Sharma Mailthody, Vishwanath Venkatesan, and many others contributed to this post.

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Early adopters and real-world impact

Accelerating agentic inference by 4x with Dynamo and NVIDIA NeMo Agent Toolkit

Advancing multimodal inference optimization

Adding native support for video generation

Accelerating inference startup by 7x with Dynamo ModelExpress

Scaling Kubernetes on NVIDIA GB300 NVL72

Integration with the Kubernetes Inference Gateway

Deploying fast, latency-aware inference with zero configurations

Increasing resiliency with fault detection and request migration

Advancing KV caching to storage

Looking ahead

Acknowledgments

Tags

About the Authors

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Early adopters and real-world impact

Accelerating agentic inference by 4x with Dynamo and NVIDIA NeMo Agent Toolkit

Advancing multimodal inference optimization

Adding native support for video generation

Accelerating inference startup by 7x with Dynamo ModelExpress

Scaling Kubernetes on NVIDIA GB300 NVL72

Integration with the Kubernetes Inference Gateway

Deploying fast, latency-aware inference with zero configurations

Increasing resiliency with fault detection and request migration

Advancing KV caching to storage

Looking ahead

​​Acknowledgments

Tags

About the Authors

Comments

Related posts

NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

Optimize AI Inference Performance with NVIDIA Full-Stack Solutions

NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks

Related posts

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

Bringing AI Closer to the Edge and On-Device with Gemma 4

Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

Acknowledgments