NVIDIA Grove

NVIDIA Grove is an open-source Kubernetes API that defines the structure and lifecycle of single- and multi-node AI inference workloads, such as those deployed with NVIDIA Dynamo, while enabling them to scale efficiently in Kubernetes-based environments.

Purpose-built for orchestrating large-scale AI workloads with complex requirements in GPU clusters, Grove lets developers describe multi-component workloads—including specific roles, dependencies, multi-level-scaling rules, and startup order—within a single custom resource. Grove is a modular component of NVIDIA Dynamo, but it can also be deployed as a standalone solution or integrated into other high-performance inference frameworks.

Get Started Documentation

How NVIDIA Grove Works

High-performance inference frameworks use Grove's hierarchical APIs to express role-specific logic and multi-level scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multi-component AI workloads using three hierarchical custom resources in its workload API.

PodCliques represent groups of Kubernetes pods with specific roles, such as prefill worker, decode leader, or frontend service, each with independent configuration and scaling logic.
PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, like the prefill leader and prefill workers that need coordinated scaling behavior.
PodCliqueSets define the entire multi-component workload, specifying startup order, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for additional capacity, Grove creates complete replicas of the entire PodGangSet and defines spread constraints that distribute these replicas across the cluster for high availability, while keeping each replica's components network-packed for optimal performance.

A Grove-enabled Kubernetes cluster requires the Grove operator and a scheduler that understands PodGang resources, such as KAI scheduler.

When a PodCliqueSet resource is created, Grove's operator validates the specification and automatically generates the necessary Kubernetes resources, including the constituent PodCliques, PodCliqueScalingGroups, and associated services, secrets, and autoscaling policies. The Grove operator then creates PodGang resources that translate workload requirements into scheduling constraints for the scheduler. Each PodGang contains PodGroups with minimum replica guarantees, network topology packing requirements for performance, and spread constraints for availability, achieving topology-aware placement and efficient resource utilization across the cluster.

The scheduler watches for these PodGang resources and applies gang-scheduling logic, ensuring all required components are scheduled together or not at all, while optimizing placement based on GPU cluster topology. This process results in coordinated deployment of multi-component AI stacks where prefill services, decode workers, and routing components start in the correct order with optimal network placement, preventing resource deadlocks and partial deployments that waste resources in the cluster.

Quick-Start Guide

Deploy your first AI inference workload using PodGangSets, PodCliques, and ScalingGroups, going from installation to running disaggregated inference on Kubernetes in minutes.

Get Started

Why Grove: The Orchestration and Scaling Problem

Read how Grove transforms complex AI inference workloads from dozens of YAML files and manual coordination into single, declarative custom resource definitions (CRDs) with built-in intelligence.

Read the Blog

Discover More About Grove

Read the complete API reference, advanced configuration options, and detailed guides for deploying Grove in production environments.

Read Docs

Get Started With NVIDIA Grove

Install Grove on Kubernetes and run your first multi-component AI workload.

Get Grove Running on Your Cluster

Grove installation deploys the Grove operator, creating the necessary CRDs for PodCliqueSets, PodCliques, and PodCliqueScalingGroups, along with controllers for managing workloads and scheduling constraint generation.

Install via Helm Charts

Install With Make Targets

Take a Deep Dive Into NVIDIA Grove

Learn what Grove solves, its key capabilities, and how it enables declarative workload definition with easy-to-use, high-level APIs for scheduler-level optimizations.

Watch Video

NVIDIA Grove Starter Kits

Disaggregated Serving

Disaggregated inference separates model serving into specialized components (prefill, decode, routing) based on their different needs. This kit explores the architectural patterns and orchestration challenges of disaggregated serving.

Basics of LLM Inference (Tech Blog)
Introduction to Disaggregated Serving and NVIDIA Dynamo (Tech Blog)
Optimizing Performance of MoE Models on NVIDIA GB200 (Tech Blog)

Scheduling AI Workloads

Scheduling workloads plays a critical role throughout the entire AI lifecycle, from initial model training to inference. This kit covers advanced scheduling concepts essential for high-performance AI workloads in the context of KAI scheduler.

Introduction to KAI Scheduler and Scheduling Concepts (Tech Blog)
Why Scheduling Matters for Disaggregated Serving (Tech Blog)
Gang Scheduling and Workload Prioritization (Tech Blog)

Inference Optimization

Maximizing AI inference performance requires understanding and applying advanced optimization techniques across hardware and software. This kit covers different approaches to achieve optimal throughput and latency in production environments.

Speculative Decoding for Higher Throughput (Tech Blog)
Choosing Between Pipeline and Tensor Parallelism (Tech Blog)
MultiShot Communication Protocol (Tech Blog)

NVIDIA Grove Learning Library

Tech Blog

Advanced AI Workload Scheduling at Scale With KAI Scheduler

NVIDIA KAI Scheduler

Get a technical overview of KAI scheduler, its value for machine learning teams, and the scheduling cycle and actions.

Video

Introduction to NVIDIA Dynamo

NVIDIA Dynamo

Learn about NVIDIA Dynamo’s key components and architecture and how they enable seamless scaling and optimized inference in distributed environments.

Documentation

Deploy Dynamo Workloads Across Multiple Nodes With Grove

NVIDIA Dynamo

Learn how to deploy multi-node NVIDIA Dynamo workloads using Grove's API for topology-optimized inference at scale.

Tech Blog

Understanding Gang-Scheduling in Practice

NVIDIA KAI Scheduler

Learn core gang-scheduling concepts through Ray workloads, including queue creation, job submission, and priority-based preemption.

Video

KV Cache-Aware Smart Router With NVIDIA Dynamo

NVIDIA Dynamo

Explore how NVIDIA Dynamo can accelerate time to first token and request latency with (key value) KV cache-aware smart routing.

More Resources

Join the Discord Community

Get Training and Certification

Join the NVIDIA Developer Program

Ethical AI

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety and Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Get started with NVIDIA Grove today.

Get Started