The emergence of several new-frontier, open source models in recent weeks, including OpenAI’s gpt-oss and Moonshot AI’s Kimi K2, signals a wave of rapid LLM innovation. Dynamo 0.4, available today, delivers new capabilities aimed at deploying such models at scale and with low cost. It focuses on performance, observability, and autoscaling based on service-level objectives (SLO).

Key Dynamo 0.4 highlights include:

4x faster performance with disaggregation on NVIDIA Blackwell
Large-scale expert parallel deployment guides on GB200 NVL72 and Hopper
New prefill-decode (PD) configurator tool to simplify disaggregated setups
SLO-based PD autoscaling with Kubernetes integration
Built-in observability metrics for real time performance monitoring
Enhanced resiliency with inflight request re-routing and early failure detection

Read on for more information about these updates.

How Dynamo 0.4 delivers 4x faster inference performance with disaggregation

The Dynamo 0.4 release brings significant disaggregated serving performance gains for NVIDIA Blackwell. Running the new OpenAI gpt-oss-120b model using Dynamo and TensorRT-LLM on NVIDIA B200 achieved up to 4x faster interactivity (tokens/second/user) for very long input sequence lengths—common in agentic workflows, code generation, and summarization, without any throughput tradeoffs.

Additionally, running the DeepSeek-R1 671B model on NVIDIA GB200 NVL72 and TensorRT-LLM with Dynamo achieved 2.5x higher throughput (tokens/second/GPU) without any increase in inference costs.

Two bar charts comparing performance. The left bar chart shows a 4x increase in interactivity from aggregated serving (left) and disaggregated serving (right), which illustrates improvements from using disaggregated serving with Dynamo and OpenAI gpt-oss 120B. The right bar chart shows a 2.5x increase in user interactivity from aggregated serving (left) and disaggregated serving (right), which illustrates improvements from using disaggregated serving with Dynamo and DeepSeekR1 671B. — Figure 1. Disaggregated serving solves the resource contention issue between prefill and decode, unlocking significant performance gains with the same GPU budget. Note: Results don’t represent maximum throughput or minimum latency performance. Check out this *link* *for the latest inference performance.*

These performance boosts were made possible by disaggregated serving in Dynamo, which decouples the prefill and decode phases of model inference across separate GPUs. By separating these stages, Dynamo enables flexible allocation of GPU resources and model parallelism to each phase based on its specific requirements, significantly improving overall efficiency.

Today, we’re excited to release the scripts that empower the community to reproduce these results and fully leverage the cost efficiency of disaggregated serving architectures. See the GitHub links here:

To enable researchers, engineers, and organizations to explore the benefits of MoE model serving with disaggregated serving, we’re also providing comprehensive, step-by-step deployment guides that walk you through setting up DeepSeek-R1 with SGLang and Llama4 Maverick with TensorRT-LLM in multi-node environments using Dynamo. See the GitHub links here:

How to remove the guesswork from setting up your disaggregated serving cluster

One of the key challenges we’ve consistently heard from inference teams adopting disaggregated serving is the difficulty in estimating the expected throughput benefits and determining the right configuration for their specific deployments. Specifically, users told us they struggle with choosing how many GPUs to allocate to the prefill and decode stages, and what type of model parallelism to use to meet their target SLOs.

To address this, we are introducing AIConfigurator, a new tool designed to recommend optimal PD disaggregation configuration and model parallel strategy. It’s tailored to a given model and GPU budget while meeting SLOs.

Two screenshots vertically stacked displaying the AIConfigurator terminal output for deploying QWEN_32B using disaggregated serving. On top is an image displaying a terminal-based Pareto frontier chart visualizing performance trade-offs between different AI deployment configurations. The bottom screenshot includes two main sections: "Disaggregated Top Configurations" and "Aggregated Top Configurations," both sorted by tokens per second per GPU (tokens/s/gpu). Each section ranks the top five configurations based on throughput and provides additional detailed metrics. — Figure 2. Screenshots of the AIConfigurator CLI dashboard visualizing throughput versus latency trade-offs and disaggregation benefits. It demonstrates 2.36x higher throughput at comparable latency levels for the Qwen3-32B model on a 512-GPU cluster using disaggregation.

By leveraging a rich set of pre-measured performance data across different layers of the model (including attention, FFN, communication, and memory), and modeling different scheduling techniques (static batching, inflight batching, and disaggregated serving), AIConfigurator suggests PD configs that satisfy user-defined SLOs within the defined GPU budget and maximize throughput per GPU. The tool will then automatically generate backend configurations that can be seamlessly deployed in Dynamo.

We’re launching AIConfigurator with both CLI and web interfaces, and initial support for TensorRT-LLM on NVIDIA Hopper. Additional inference frameworks and NVIDIA hardware will be supported in upcoming releases.

How to consistently meet inference SLOs without over or underprovisioning GPUs

In our May 0.2 release, we introduced the first version of Planner, a GPU autoscaling engine purpose-built for generative AI inference and PD disaggregation. By monitoring prefill queue and decode memory usage, Planner intelligently scaled inference workers up or down to maximize GPU utilization and minimize inference costs.

With the 0.4 release, we’re taking Planner a step further. We’re introducing SLO-based autoscaling, enabling inference teams to not only optimize for cost, but also to reliably meet strict performance targets like Time to First Token (TTFT) and Inter-Token Latency (ITL).

Unlike traditional, reactive scaling systems, the new SLO-based Planner takes a forward-looking approach:

It leverages pre-deployment profiling to understand how your deployment behaves under different model parallel and batching configurations.
It suggests the most cost-effective engine configurations based on your SLOs.
It predicts future traffic patterns using advanced time-series models such as ARIMA or Prophet.
It calculates the minimum number of PD workers required to meet SLA targets under predicted demand.
It continuously assesses traffic patterns and dynamically re-adjusts PD workers to maintain target SLAs.

What sets Planner apart is its ability to forecast the impact of changes in input/output sequence lengths and proactively scale resources before bottlenecks occur.

SLO-based Planner allows inference teams to:

Stay in control of user experience and infrastructure spend
Maintain SLA performance without over or under provisioning resources
Optimize GPU usage without manual tuning

Watch the demo video below to see planner in action:

Video 2. See how Dynamo Planner can dynamically autoscale prefill and decode GPUs based on predicted incoming requests patterns.

Planner integrates natively with Kubernetes, making it easy for organizations that have standardized on containerized infrastructure to deploy Dynamo and use Planner to scale their AI workloads. This release includes Planner support for vLLM, with support for additional inference frameworks coming in future updates.

How to track real-time inference observability metrics

Observability is critical in large-scale distributed inference environments, enabling engineering teams to monitor system health, diagnose performance bottlenecks, and meet strict SLOs, where latency, throughput, and GPU utilization must be continuously optimized in real time.

A screenshot of a dashboard showing multiple charts displaying different performance metrics including requests per second, average time to first token, and average inter-token latency. — *Figure 3. Grafana Dashboard showing key performance metrics collected by Dynamo*

With this release, Dynamo workers and components across the event, control, and data planes now emit key observability metrics, including:

Average requests per second and request duration
Average time to first token (TTFT) and inter-token latency (ITL)
Average input and output sequence length
GPU utilization and power usage

These metrics are collected using the open source Prometheus toolkit and can be easily consumed in open source monitoring and observability tools like Grafana without custom development efforts.

This release also includes an API for engineering teams and solution architects to define and emit custom metrics tailored to their serving environments, providing further flexibility and extensibility.

This observability foundation in Dynamo 0.4 sets the stage for upcoming releases, which will introduce more granular, use case–specific metrics, including for PD disaggregation.

How Dynamo 0.4 enhances resiliency and early fault detection

Deploying frontier reasoning MoE models at scale requires multinode environments that can span hundreds of GPUs. In these setups, a failure in any software or hardware component, no matter how brief, can interrupt the operation of the entire system and result in delays and failed user requests—disrupting business operations and hurting customer experience.

The Dynamo 0.4 release introduces fault tolerance and resilience features, including inflight request re-routing. In previous versions, requests sent to offline GPUs would fail and bounce back to higher layers of the inference stack or back to the end user. This triggered retries that repeated pre-processing steps such as tokenization and embeddings, wasting compute and increasing latency. With this update, Dynamo now re-routes requests inflight, preserving intermediate computations and forwarding them directly to online GPUs and eliminating redundant work.

Diagram comparing workflows with and without inflight request re-routing. The top diagram, without re-routing, shows how a worker disconnect forces a restart of the generation from the beginning. In the bottom, with inflight re-routing, the system continues from the last successful step on a new worker, avoiding repeated work. — *Figure 4. The diagram illustrates the difference in workflow between systems without (top) and with (bottom) inflight request re-routing during a generation process.*

Additionally, this release introduces faster failure detection. In prior versions, etcd—a key component of Dynamo’s control plane—was responsible for detecting offline workers and broadcasting that state across the system. However, this added several seconds of delay during which requests could still be routed to offline workers. The new release introduces early failure detection within the Dynamo smart router, allowing it to bypass etcd and react to critical health signals. This reduces the detection-to-recovery window and significantly cuts down on failed requests.

Back to the basics: What happens during inference when you ask an LLM a question?

If you want to revisit the fundamentals of disaggregated serving in NVIDIA Dynamo, it helps to start with what happens when you ask an LLM a question—a process called inference, which spans everything from prefill to decode and token prediction.

In this video, we break down exactly how it works, how it’s evolving and how NVIDIA Dynamo accelerates each stage. Learn how disaggregated serving splits these steps across multiple GPUs for faster, more efficient AI responses.

Video 2. Learn about the different stages of LLM inference and how separating these stages on different GPUs with Dynamo can boost performance.

How to get involved

We’re excited to keep progressing Dynamo with the help of our developer community. Catch up on past Office Hours recordings and tune into our upcoming office hours to get your questions answered directly by the team.

Join our Discord community to connect with other developers, share feedback, and get support in real time. And if you’re excited about where we’re headed, check out the open source repo, we welcome contributions, issues, and ideas from the community.