The emergence of several new-frontier, open source models in recent weeks, including OpenAI’s gpt-oss and Moonshot AI’s Kimi K2, signals a wave of rapid LLM innovation. Dynamo 0.4, available today, delivers new capabilities aimed at deploying such models at scale and with low cost. It focuses on performance, observability, and autoscaling based on service-level objectives (SLO).
Key Dynamo 0.4 highlights include:
- 4x faster performance with disaggregation on NVIDIA Blackwell
- Large-scale expert parallel deployment guides on GB200 NVL72 and Hopper
- New prefill-decode (PD) configurator tool to simplify disaggregated setups
- SLO-based PD autoscaling with Kubernetes integration
- Built-in observability metrics for real time performance monitoring
- Enhanced resiliency with inflight request re-routing and early failure detection
Read on for more information about these updates.
How Dynamo 0.4 delivers 4x faster inference performance with disaggregation
The Dynamo 0.4 release brings significant disaggregated serving performance gains for NVIDIA Blackwell. Running the new OpenAI gpt-oss-120b model using Dynamo and TensorRT-LLM on NVIDIA B200 achieved up to 4x faster interactivity (tokens/second/user) for very long input sequence lengths—common in agentic workflows, code generation, and summarization, without any throughput tradeoffs.
Additionally, running the DeepSeek-R1 671B model on NVIDIA GB200 NVL72 and TensorRT-LLM with Dynamo achieved 2.5x higher throughput (tokens/second/GPU) without any increase in inference costs.

These performance boosts were made possible by disaggregated serving in Dynamo, which decouples the prefill and decode phases of model inference across separate GPUs. By separating these stages, Dynamo enables flexible allocation of GPU resources and model parallelism to each phase based on its specific requirements, significantly improving overall efficiency.
Today, we’re excited to release the scripts that empower the community to reproduce these results and fully leverage the cost efficiency of disaggregated serving architectures. See the GitHub links here:
- Deploy OpenAI gpt-oss-120b using Dynamo and TensorRT-LLM on B200 (8xGPUs)
- Deploy DeepSeek-R1 671B using Dynamo and TensorRT-LLM on GB200 NVL72 (64xGPUs)
To enable researchers, engineers, and organizations to explore the benefits of MoE model serving with disaggregated serving, we’re also providing comprehensive, step-by-step deployment guides that walk you through setting up DeepSeek-R1 with SGLang and Llama4 Maverick with TensorRT-LLM in multi-node environments using Dynamo. See the GitHub links here:
- Deploy DeepSeek-R1 using Dynamo and SGLang on GB200 NVL72 (56xGPUs)
- Deploy DeepSeek-R1 using Dynamo and SGLang on H100 (104xGPUs)
- Deploy Llama4 Maverick using Dynamo and TRT-LLM on GB200 NVL72 (16xGPUs)
How to remove the guesswork from setting up your disaggregated serving cluster
One of the key challenges we’ve consistently heard from inference teams adopting disaggregated serving is the difficulty in estimating the expected throughput benefits and determining the right configuration for their specific deployments. Specifically, users told us they struggle with choosing how many GPUs to allocate to the prefill and decode stages, and what type of model parallelism to use to meet their target SLOs.
To address this, we are introducing AIConfigurator, a new tool designed to recommend optimal PD disaggregation configuration and model parallel strategy. It’s tailored to a given model and GPU budget while meeting SLOs.

By leveraging a rich set of pre-measured performance data across different layers of the model (including attention, FFN, communication, and memory), and modeling different scheduling techniques (static batching, inflight batching, and disaggregated serving), AIConfigurator suggests PD configs that satisfy user-defined SLOs within the defined GPU budget and maximize throughput per GPU. The tool will then automatically generate backend configurations that can be seamlessly deployed in Dynamo.
We’re launching AIConfigurator with both CLI and web interfaces, and initial support for TensorRT-LLM on NVIDIA Hopper. Additional inference frameworks and NVIDIA hardware will be supported in upcoming releases.
How to consistently meet inference SLOs without over or underprovisioning GPUs
In our May 0.2 release, we introduced the first version of Planner, a GPU autoscaling engine purpose-built for generative AI inference and PD disaggregation. By monitoring prefill queue and decode memory usage, Planner intelligently scaled inference workers up or down to maximize GPU utilization and minimize inference costs.
With the 0.4 release, we’re taking Planner a step further. We’re introducing SLO-based autoscaling, enabling inference teams to not only optimize for cost, but also to reliably meet strict performance targets like Time to First Token (TTFT) and Inter-Token Latency (ITL).
Unlike traditional, reactive scaling systems, the new SLO-based Planner takes a forward-looking approach:
- It leverages pre-deployment profiling to understand how your deployment behaves under different model parallel and batching configurations.
- It suggests the most cost-effective engine configurations based on your SLOs.
- It predicts future traffic patterns using advanced time-series models such as ARIMA or Prophet.
- It calculates the minimum number of PD workers required to meet SLA targets under predicted demand.
- It continuously assesses traffic patterns and dynamically re-adjusts PD workers to maintain target SLAs.
What sets Planner apart is its ability to forecast the impact of changes in input/output sequence lengths and proactively scale resources before bottlenecks occur.
SLO-based Planner allows inference teams to:
- Stay in control of user experience and infrastructure spend
- Maintain SLA performance without over or under provisioning resources
- Optimize GPU usage without manual tuning
Watch the demo video below to see planner in action:
Planner integrates natively with Kubernetes, making it easy for organizations that have standardized on containerized infrastructure to deploy Dynamo and use Planner to scale their AI workloads. This release includes Planner support for vLLM, with support for additional inference frameworks coming in future updates.
How to track real-time inference observability metrics
Observability is critical in large-scale distributed inference environments, enabling engineering teams to monitor system health, diagnose performance bottlenecks, and meet strict SLOs, where latency, throughput, and GPU utilization must be continuously optimized in real time.

With this release, Dynamo workers and components across the event, control, and data planes now emit key observability metrics, including:
- Average requests per second and request duration
- Average time to first token (TTFT) and inter-token latency (ITL)
- Average input and output sequence length
- GPU utilization and power usage
These metrics are collected using the open source Prometheus toolkit and can be easily consumed in open source monitoring and observability tools like Grafana without custom development efforts.
This release also includes an API for engineering teams and solution architects to define and emit custom metrics tailored to their serving environments, providing further flexibility and extensibility.
This observability foundation in Dynamo 0.4 sets the stage for upcoming releases, which will introduce more granular, use case–specific metrics, including for PD disaggregation.
How Dynamo 0.4 enhances resiliency and early fault detection
Deploying frontier reasoning MoE models at scale requires multinode environments that can span hundreds of GPUs. In these setups, a failure in any software or hardware component, no matter how brief, can interrupt the operation of the entire system and result in delays and failed user requests—disrupting business operations and hurting customer experience.
The Dynamo 0.4 release introduces fault tolerance and resilience features, including inflight request re-routing. In previous versions, requests sent to offline GPUs would fail and bounce back to higher layers of the inference stack or back to the end user. This triggered retries that repeated pre-processing steps such as tokenization and embeddings, wasting compute and increasing latency. With this update, Dynamo now re-routes requests inflight, preserving intermediate computations and forwarding them directly to online GPUs and eliminating redundant work.

Additionally, this release introduces faster failure detection. In prior versions, etcd—a key component of Dynamo’s control plane—was responsible for detecting offline workers and broadcasting that state across the system. However, this added several seconds of delay during which requests could still be routed to offline workers. The new release introduces early failure detection within the Dynamo smart router, allowing it to bypass etcd and react to critical health signals. This reduces the detection-to-recovery window and significantly cuts down on failed requests.
Back to the basics: What happens during inference when you ask an LLM a question?
If you want to revisit the fundamentals of disaggregated serving in NVIDIA Dynamo, it helps to start with what happens when you ask an LLM a question—a process called inference, which spans everything from prefill to decode and token prediction.
In this video, we break down exactly how it works, how it’s evolving and how NVIDIA Dynamo accelerates each stage. Learn how disaggregated serving splits these steps across multiple GPUs for faster, more efficient AI responses.
How to get involved
We’re excited to keep progressing Dynamo with the help of our developer community. Catch up on past Office Hours recordings and tune into our upcoming office hours to get your questions answered directly by the team.
Join our Discord community to connect with other developers, share feedback, and get support in real time. And if you’re excited about where we’re headed, check out the open source repo, we welcome contributions, issues, and ideas from the community.