Data Center / Cloud

Introducing NVIDIA Fleet Intelligence for Real-Time GPU Fleet Visibility and Optimization

The compute capability of large GPU fleets presents unprecedented opportunities to innovate and provide value to customers in record time. Yet these advancements come with a variety of challenges. At scale, teams are juggling heterogeneous hardware, fast‑moving software stacks, tight power envelopes, and spiky, multitenant workloads. A single hotspot, misconfigured driver, or subtle hardware fault can ripple, causing throttled jobs, missed SLAs and wasted spend. 

As well, the complexity and number of components involved in large-scale clusters can be daunting, so it’s essential to maintain visibility into the day-to-day operations and understand the operational state at any given time. Monitoring GPU utilization and identifying bottlenecks during job execution becomes more difficult. Identifying areas of low utilization and migrating workloads to them is one of the best ways to ensure the highest return on investment.  

For these reasons, GPU‑aware monitoring is essential at scale. Teams need visibility beyond whether or not the node is up. They need to know whether, at any given moment, every accelerator is performing as expected, safely, and consistently. 

This post introduces NVIDIA Fleet Intelligence, an agent-based managed service for continuous monitoring of NVIDIA data center GPUs. It is now generally available. 

What are the key focus areas of GPU monitoring?

Important areas of GPU monitoring include power, temperature, performance, health, and uniform configuration.

  • Power: Track power utilization and throttling to stay within data center budgets while maximizing performance per watt.
  • Temperature: Detect hotspots and airflow issues early to avoid thermal throttling and premature component aging.
  • Performance: Watch utilization, memory bandwidth, interconnect health, and throttling reasons to spot regressions and imbalance across the fleet.
  • Health: Surface ECC and XID errors, retired pages, HBM/NVLink/PCIe anomalies, and other RAS signals to catch failing parts before they fail.
  • Uniform configuration and integrity: As part of GPU inventory validation, check for consistent drivers, firmware, and BIOS settings to ensure reproducible results and safe operation, as well as verify firmware integrity.

What is NVIDIA Fleet Intelligence?

NVIDIA Fleet Intelligence is a low-level, deployment-agnostic managed service that can be used regardless of software stack or scheduler choice. Initially, the service supports data center GPU and CPU customers that are managing their own infrastructure, and engineers requiring more insight into GPU and CPU behavior. 

The service leverages technology and IP from across the NVIDIA portfolio of products and learnings from running the NVIDIA fleet of hundreds of thousands of GPUs across NVIDIA DGX Cloud

Fleet Intelligence uses a low-footprint, host-based agent to stream GPU telemetry back to the fully managed Fleet Intelligence cloud service. NVIDIA is releasing the Fleet Intelligence agent as an open source project for auditability. The agent leverages other NVIDIA open source solutions such as GPUd, NVIDIA Data Center GPU Manager (DCGM), and the NVIDIA Attestation SDK. To learn more, visit NVIDIA/fleet-intelligence-agent on GitHub. Fleet Intelligence has been developed with feedback from early access (EA) customers, including NVIDIA Cloud Partners (NPCs), Lambda and IREN. 

This GA release focuses on three main areas: 

  • Inventory and visualization
  • Reporting, alerts, and health checks
  • Integrity and attestation 

Inventory and visualization

Fleet Intelligence offers a rich capability to visualize global fleet inventory across data centers and clouds. An agent, with a minimal footprint, is installed through Linux packages managers or helm install on the GPU worker nodes.

Once enrolled, the agent captures node-level information which is displayed in the Health portal resident on NVIDIA NGC. As a user, you can view your GPU fleet utilization globally or by compute zones, including groups of nodes enrolled in the same physical or cloud location. 

At any level of the infrastructure, anomalies are immediately surfaced—for example, from errors or thresholds that were crossed by power consumption or temperature. This enables direct access to review detailed information about what triggered the alert. 

Reporting, alerts, and health checks

The Fleet Intelligence agent leverages technology from GPUd and DCGM. Metrics provided by both tools are analyzed and communicated back to the Health Service for review. The agent allows Fleet Intelligence to monitor the health of the fleet in near real time, as well as execute periodic health checks. The agent collects telemetry on host, GPUs, NVLink, and networking to provide a holistic picture of overall system health. 

As signals are collected, the service analyzes errors in the context of current state and history to provide recommendations on remediation actions. The agent is read-only, will not make modifications to host configuration, and only collects machine telemetry and state data. To verify the data collected, you can write sample output locally or review the source code from the public repo.  

You can also opt in to receive alert messages in event of an error or failure through email, Slack, and other channels, and configure custom alerts for low utilization thresholds or other areas of interest. Users can configure reports to view inventory and historical graphs of power consumption, temperature trends, errors, and downtime. 

The Fleet Intelligence agent employs passive health checks as well as periodic checks. These health checks have been available through DCGM and GPUd. New health checks created from learnings derived from operating the fleets are added as they become available. Fleet Intelligence will continuously gather anonymous signals and other metadata around faults and errors across the install base. This approach enables greater fidelity of data to apply to predictive failure categorization models that will be available in future releases.

Integrity and attestation 

Leveraging technology from NVIDIA Confidential Computing solutions, Fleet Intelligence cryptographically verifies GPU integrity, ensuring the authenticity and trustworthiness of your system. The Fleet Intelligence agent uses the Attestation SDK to obtain measurements from the GPU (or “evidence”) at run time. These measurements are then digitally signed using on-device certificates based on NVIDIA root of trust. 

The evidence is then sent to NVIDIA Remote Attestation Service (NRAS) over a secure channel for verification. The NRAS service leverages NVIDIA Reference Integrity Manifests (RIMs), which are structures generated as part of vBIOS builds. The NRAS service validates that the evidence matches the expected values and returns a pass/fail to the Fleet Intelligence service.  

You can then view your inventory dashboards and see the resulting integrity checks that are run daily or on demand. These integrity checks ensure every GPU in the fleet has known-good configuration that is untampered with and up to date. You can also create Fleet Intelligence reports that detail GPU Fleet information with the current integrity status. These can be downloaded and used with other reporting tools.

According to Chuan Li, Chief Scientific Officer at Lambda, “NVIDIA Fleet Intelligence gave Lambda’s research team end-to-end visibility across our NVIDIA Blackwell/Hopper GPU fleet with minimal setup. Its alerts catch both active failures and early warning signs. Its reports turn fleet-wide health into actionable insights.” 

Get started with NVIDIA Fleet Intelligence 

NVIDIA Fleet Intelligence service provides comprehensive insights into power, temperature, performance, health, and configuration of NVIDIA GPU and CPU fleets, ensuring that every chip operates with optimal efficiency and reliability. The integration of low-footprint agents for real-time telemetry, combined with robust visualization and alert mechanisms, empowers enterprises to maximize ROI and maintain optimal operational standards. 

The open source Fleet Intelligence agent and the incorporation of cutting-edge integrity and attestation technologies underscore the NVIDIA commitment to transparency and security. As businesses continue to scale GPU and CPU deployments, Fleet Intelligence provides essential tools for navigating the complexities of modern data centers, ensuring sustainable and predictable performance across diverse environments. 

Request access to NVIDIA Fleet Intelligence and experience firsthand how it can improve the availability and integrity of your GPU fleet. It is now generally available and offered at no cost to NVIDIA data center GPU owners, operators, and cloud tenants. Fleet Intelligence supports NVIDIA data center-class GPU architectures Vera Rubin, Blackwell, and Hopper. Attestation is only supported on Vera Rubin and Blackwell.

Discuss (0)

Tags