NVIDIA Dynamo Platform

The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform designed to serve all AI models across any framework, architecture, or deployment scale. Whether you're running image recognition on a single entry-level GPU or deploying billion-parameter reasoning large language models (LLMs) across hundreds of thousands of data center GPUs, the NVIDIA Dynamo Platform delivers scalable, efficient AI inference.

NVIDIA Dynamo Platform

The NVIDIA Dynamo Platform includes two open-source inference serving frameworks:

NVIDIA Dynamo

NVIDIA Dynamo is an open-source, low-latency, modular inference framework for serving generative AI models in distributed environments. It enables seamless scaling of inference workloads across large GPU fleets with intelligent resource scheduling and request routing, optimized memory management, and seamless data transfer. NVIDIA Dynamo supports all major AI inference backends and features large language model (LLM)-specific optimizations, such as disaggregated serving. When serving the open-source DeepSeek-R1 671B reasoning model on NVIDIA GB200 NVL72, NVIDIA Dynamo increased throughput—measured in tokens per second per GPU—by up to 30X. Serving the Llama 70B model on NVIDIA Hopper™, it increased throughput by more than 2X. NVIDIA Dynamo is the ideal solution for developers looking to accelerate and scale generative AI models with the highest efficiency at the lowest cost.

Get StartedDocumentation

NVIDIA Dynamo-Triton

NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server™, standardizes AI model deployment and execution across every workload. It supports high-performance inference on both NVIDIA GPUs and x86 & Arm CPUs. It can be used to deploy models from all popular frameworks. These include NVIDIA TensorRT™-LLM, vLLM, TensorFlow, PyTorch, Python, ONNX, RAPIDS cuML, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C++, and more. Dynamo-Triton optimizes inference for multiple query types – real time, batch, streaming, and supports ensemble models to seamlessly connect multiple models into AI pipelines. Compatible with all major cloud and on-premises AI and MLOps platforms, NVIDIA Dynamo-Triton is ideal for developers looking to quickly build new AI powered applications.

Get StartedDocumentation


See NVIDIA Dynamo in Action

Distributed Inference 101 Video Series

Getting Started With NVIDIA Dynamo

Watch Video

KV Cache-Aware Smart Router With NVIDIA Dynamo

Watch Video

Disaggregated Serving With NVIDIA Dynamo

Watch Video

Monitoring Data Center Performance and Metrics

Watch Video

Managing KV Cache to Speed Up Inference Latency

Watch Video

How NVIDIA Dynamo Works

Models are becoming larger and more integrated into AI workflows that require interaction with multiple models. Deploying these models at scale involves distributing them across multiple nodes, requiring careful coordination across GPUs. The complexity increases with inference optimization methods, like disaggregated serving, which splits responses across different GPUs, adding challenges in collaboration and data transfer.

NVIDIA Dynamo addresses the challenges of distributed and disaggregated inference serving. It includes four key components:

  • GPU Resource Planner: A planning and scheduling engine that monitors capacity and prefill activity in multi-node deployments to adjust GPU resources and allocate them across prefill and decode.

  • Smart Router: A KV-cache-aware routing engine that efficiently directs incoming traffic across large GPU fleets in multi-node deployments to minimize costly re-computations.

  • Low Latency Communication Library: State-of-the-art inference data transfer library that accelerates the transfer of KV cache between GPUs and across heterogeneous memory and storage types.

  • KV Cache Manager: A cost-aware KV cache offloading engine designed to transfer KV cache across various memory hierarchies, freeing up valuable GPU memory while maintaining user experience.

A flowchart of how NVIDIA Dynamo works

Watch the recording to learn about NVIDIA Dynamo’s key components and architecture and how they enable seamless scaling and optimized inference in distributed environments.

Quick-Start Guide

Learn the basics for getting started with NVIDIA Dynamo, including how to deploy a model in a disaggregated server setup and how to launch the smart router.

Get Started

Introductory Blog

Read about how NVIDIA Dynamo helps simplify AI inference in production, the tools that help with deployments, and ecosystem integrations.

Read Blog

Deploy LLM Inference With NVIDIA Dynamo and vLLM

NVIDIA Dynamo supports all major backends, including vLLM. Check out the tutorial to learn how to deploy with vLLM.

Read Docs

Get Started With NVIDIA Dynamo

Find the right license to deploy, run, and scale AI inference for any application on any platform.

Download Code for Development

NVIDIA Dynamo and NVIDIA Dynamo-Triton are available as open-source software on GitHub with end-to-end examples.

Purchase NVIDIA AI Enterprise

NVIDIA Dynamo-Triton is available with enterprise-grade support, security, stability, and manageability with NVIDIA AI Enterprise. NVIDIA Dynamo will be included in NVIDIA AI Enterprise for production inference in a future release.

Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.


Starter Kits

Access technical content on inference topics like prefill optimizations, decode optimizations, and multi-GPU inference.

Multi-GPU Inference

Models have grown in size and can no longer fit on a single GPU. Deploying these models involves distributing them across multiple GPUs and nodes. This kit shares key optimization techniques for multi-GPU inference.

Prefill Optimizations

When a user submits a request to a large language model, it generates a KV cache to compute a contextual understanding of the request. This process is computationally intensive and requires specialized optimizations. This kit presents essential KV cache optimization techniques for inference.

Decode Optimizations

Once the LLM generates the KV cache and the first token, it moves into the decode phase, where it autoregressively generates the remaining output tokens. This kit highlights key optimization techniques for the decoding process.


More Resources

Decorative image representing forums

Explore Developer Discord

Get Training and Certification

Accelerate Your Startup

Decorative image representing forums

Sign Up for Inference-Related Developer News

Read NVIDIA Dynamo FAQ

Join the NVIDIA Developer Program


Ethical AI

NVIDIA believes trustworthy AI is a shared responsibility, and we have established policies and practices to support the development of AI across a wide array of applications. When downloading or using this model in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI concerns here.

Get Started With NVIDIA Dynamo Today

Download Now