NVIDIA Grove is an open-source Kubernetes API that defines the structure and lifecycle of single- and multi-node AI inference workloads, such as those deployed with NVIDIA Dynamo , while enabling them to scale efficiently in Kubernetes-based environments. Purpose-built for orchestrating large-scale AI workloads with complex requirements in GPU clusters, Grove lets developers describe multi-component workloads—including specific roles, dependencies, multi-level-scaling rules, and startup order—within a single custom resource. Grove is a modular component of NVIDIA Dynamo, but it can also be deployed as a standalone solution or integrated into other high-performance inference frameworks.

How NVIDIA Grove Works

High-performance inference frameworks use Grove's hierarchical APIs to express role-specific logic and multi-level scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multi-component AI workloads using three hierarchical custom resources in its workload API.

PodCliques represent groups of Kubernetes pods with specific roles, such as prefill worker, decode leader, or frontend service, each with independent configuration and scaling logic.

PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, like the prefill leader and prefill workers that need coordinated scaling behavior.

PodCliqueSets define the entire multi-component workload, specifying startup order, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for additional capacity, Grove creates complete replicas of the entire PodGangSet and defines spread constraints that distribute these replicas across the cluster for high availability, while keeping each replica's components network-packed for optimal performance.

A Grove-enabled Kubernetes cluster requires the Grove operator and a scheduler that understands PodGang resources, such as KAI scheduler.



When a PodCliqueSet resource is created, Grove's operator validates the specification and automatically generates the necessary Kubernetes resources, including the constituent PodCliques, PodCliqueScalingGroups, and associated services, secrets, and autoscaling policies. The Grove operator then creates PodGang resources that translate workload requirements into scheduling constraints for the scheduler. Each PodGang contains PodGroups with minimum replica guarantees, network topology packing requirements for performance, and spread constraints for availability, achieving topology-aware placement and efficient resource utilization across the cluster.



The scheduler watches for these PodGang resources and applies gang-scheduling logic, ensuring all required components are scheduled together or not at all, while optimizing placement based on GPU cluster topology. This process results in coordinated deployment of multi-component AI stacks where prefill services, decode workers, and routing components start in the correct order with optimal network placement, preventing resource deadlocks and partial deployments that waste resources in the cluster.

