Next-Generation AI Factory Telemetry with NVIDIA Spectrum-X Ethernet

As AI data centers rapidly evolve into AI factories, traditional network monitoring methods are no longer sufficient. Workloads continue to grow in complexity and infrastructures scale rapidly, making real-time, high-frequency insights critical. The need for effective system monitoring has never been greater.

This post explores how high-frequency sampling and advanced telemetry techniques can address these challenges and optimize AI workloads. The approach improves AI workload performance in NVIDIA Spectrum-X Ethernet fabric and provides users with proactive incident management capability.

What is AI fabric telemetry?

In the context of an AI factory, telemetry is the collection, transmission, and analysis of data related to system performance, resource usage, and other operational metrics. This real-time data is critical for managing and optimizing AI workloads.

Telemetry plays a critical role in understanding the performance and health of AI infrastructure. AI models, especially LLMs training at scale, depend heavily on high-performance computing resources. In turn, these resources rely on seamless data transfer between components such as GPUs, CPUs, and storage systems.

Why does streaming network telemetry matter for AI?

Traditional network monitoring relies on polling-based techniques, querying devices at fixed intervals (every few seconds or minutes, for example). This approach often misses short-lived anomalies or transient network issues, which can severely impact the performance of AI workloads.

As a result, traditional monitoring systems come with significantly reduced granularity and visibility. Because these transient issues occur within milliseconds and disappear just as quickly, they often go undetected. As a result, they can disrupt AI workloads, LLM operations, and inference traffic without leaving any trace—wasting valuable GPU cycles, increasing processing time, and reducing overall system efficiency.

In contrast, modern telemetry streams data continuously at high frequencies, providing granular, real-time visibility into network performance. This enables proactive incident management rather than reactive troubleshooting, and scales to handle data generated by hundreds or even thousands of nodes and GPUs.

This is particularly critical in AI workloads, where milliseconds matter and synchronization across GPUs is key to performance.

Moreover, approaching issues from a network perspective alone might not indicate such anomalies. An AI-focused, holistic monitoring approach yields the best results in these intricate environments.

Providing visibility into RDMA networks

One of the most crucial needs for AI system telemetry is visibility into Remote Direct Memory Access (RDMA) networks. RDMA technology accelerates data transfers by allowing direct memory access between systems without involving the CPU. This greatly improves throughput and reduces latency, enabling AI workloads to run faster and better utilize fabric capacity.

In order for RDMA to perform at its best, it relies on the network being truly lossless. However, AI workloads are sensitive to network issues, and even small inefficiencies can have a cascading effect on overall performance. According to recent research, RDMA is particularly sensitive to network issues and impacts GPU training efficiency. For more details, see RDMA over Ethernet for Distributed AI Training at Meta Scale and RoCE Networks for Distributed AI Training at Scale.

AI workloads—especially those using the NVIDIA Collective Communications Library (NCCL)—rely on low-latency, high-throughput, and lossless communication between nodes. Issues like jitter, congestion, or packet loss can significantly degrade model training or inference performance.

High-frequency telemetry is essential because it enables operators to:

Detect issues such as packet loss or hardware faults in real time
Maintain SLA guarantees through proactive troubleshooting
Optimize network utilization, resource allocation, and load balancing
Ensure scalability across large clusters with data-driven decision-making

Without proper telemetry, detecting these hidden RDMA-related problems—such as congestion, microsecond-level latency spikes, and packet loss—is not possible. Telemetry systems, therefore, are essential for identifying and diagnosing these issues in real time, ensuring that AI models can be trained and deployed without unnecessary slowdowns or bottlenecks.

In addition, performance profiling of AI workloads can be done with unprecedented granularity, enabling analysis of patterns, performance in production, and more.

NVIDIA Spectrum-X Ethernet AI fabric with built-in telemetry

NVIDIA Spectrum-X Ethernet is an Ethernet-based AI fabric solution, purpose-built for high-performance AI workloads. Spectrum-X Ethernet brings data center fabric together with complex AI workloads running over large-scale AI factories, enabling them to work together.

Spectrum-X Ethernet tightly integrates:

NVIDIA Spectrum SN5000 Series Ethernet switches
NVIDIA BlueField-3 SuperNICs and NVIDIA ConnectX-8 SuperNICs
NVIDIA Hopper, Blackwell, and Rubin GPUs
NVIDIA NetQ telemetry and analytics platform
Cumulus Linux network operating system

Close-up image of an NVIDIA Spectrum-4 SN5600 Series Ethernet switch. — Figure 1. *NVIDIA Spectrum-X SN5600 Series Ethernet switch*

Each component contributes its own telemetry data, collectively providing a holistic view of the fabric’s health and performance. To provide a few examples:

DTS SuperNIC telemetry may show the number of packets marked for adaptive routing.
The switch telemetry may show the actual routing decisions made.
NetQ collects, correlates, and visualizes this data through a single-pane-of-glass interface, presenting the data to the user based on NVIDIA AI expertise. It basically converts raw telemetry data from various sources into AI insights.

With this insight-oriented telemetry system, identifying and resolving network issues becomes intuitive and accessible.

Open standards for extensibility

To be effective, a telemetry system must be open and vendor-neutral. Spectrum-X Ethernet supports:

OpenTelemetry interface
gRPC Network Management Interface (gNMI)

This enables:

Interoperability with a variety of third-party tools
Extensibility across heterogeneous environments
Comprehensive metric coverage for deep root cause analysis

Example 1: Troubleshooting an LLM workload

To refine Spectrum-X Ethernet telemetry, NVIDIA engineers continuously run stress tests, using real AI workloads, on the NVIDIA Israel-1 supercomputer.

Figures 2-5 show a Grafana dashboard that uses PromQL as a data source. Grafana connects to NetQ and queries time-series databases using PromQL queries. NetQ collects real-time OTLP telemetry metrics from Spectrum Ethernet switches, GPUs, NICs, hosts (through DTS) and SLURM AI workloads. Finally, Grafana acts as a visualization layer, connecting to NetQ PromQL API to create dashboards and graphs that display these metrics.

A long-running LLM workload was expected to utilize maximum bandwidth. However, telemetry showed sharp, irregular drops in effective bandwidth (Figure 2).

Screenshot of a Grafana dashboard showing that effective bandwidth decreases due to an unknown reason. — *Figure 2. Effective bandwidth drops unexpectedly*

On further inspection, BlueField-3 DTS (through NetQ) showed high roce_adp_retrans counters, indicating RoCE packet retransmissions (Figure 3).

Screenshot of a Grafana dashboard showing number of RoCE retransmissions. — *Figure 3. RoCE adaptive routing retransmissions spike*

Switch telemetry pinpointed symbol errors on a specific port (swp11s1) on a spine switch (Figure 4).

Screenshot of a Grafana dashboard showing symbol errors gathered on a specific interface. — *Figure 4. Symbol errors detected on a specific interface*

After disabling the faulty port, bandwidth usage fully recovered, confirming the root cause (Figure 5).

Screenshot of a Grafana dashboard showing effective bandwidth after faulty port shutdown. — *Figure 5. Effective bandwidth returns to normal after port shutdown*

As this example shows, the Spectrum-X telemetry solution collects raw telemetry data from the AI fabric through multiple sources and transforms it into actionable insights.

Example 2: Detecting misconfigurations in the fabric

Another use case for Spectrum-X Ethernet telemetry solution is constant granular monitoring of fabric traffic could indicate misconfigurations in the fabric. In a healthy Spectrum-X Ethernet fabric, RoCE traffic is perfectly balanced across spine/leaf links. This can be observed in a granular way from NetQ, thanks to Spectrum-X Ethernet real-time telemetry. In the case, this perfect distribution is not seen on the fabric, we are able to conclude that there are configuration differences between leaf switches as this would be the only reason for such imbalance.

Graphic titled ‘NVIDIA Spectrum-X Adaptive Routing’ showing Spectrum-X Ethernet real-time telemetry can help detect misconfigurations in the fabric through monitoring of traffic distribution. — *Figure 6. Unbalanced packet distribution in the fabric indicates misconfiguration*

Get started with NVIDIA Spectrum-X Ethernet

AI workloads demand extremely high performance, minimal latency, and maximum reliability. Only a holistic telemetry approach spanning across the application, GPUs, SuperNICs, and switch fabric can provide the real-time insights needed to meet those demands.

With Spectrum-X Ethernet, NVIDIA delivers an integrated solution that ensures AI infrastructure is efficient, predictable, and resilient. Learn more about the NVIDIA Spectrum-X Ethernet networking platform.

Next-Generation AI Factory Telemetry with NVIDIA Spectrum-X Ethernet

What is AI fabric telemetry?

Why does streaming network telemetry matter for AI?

Providing visibility into RDMA networks

NVIDIA Spectrum-X Ethernet AI fabric with built-in telemetry

Open standards for extensibility

Example 1: Troubleshooting an LLM workload

Example 2: Detecting misconfigurations in the fabric

Get started with NVIDIA Spectrum-X Ethernet

Tags

About the Authors

Next-Generation AI Factory Telemetry with NVIDIA Spectrum-X Ethernet

What is AI fabric telemetry?

Why does streaming network telemetry matter for AI?

Providing visibility into RDMA networks

NVIDIA Spectrum-X Ethernet AI fabric with built-in telemetry

Open standards for extensibility

Example 1: Troubleshooting an LLM workload

Example 2: Detecting misconfigurations in the fabric

Get started with NVIDIA Spectrum-X Ethernet

Tags

About the Authors

Comments

Related posts

North–South Networks: The Key to Faster Enterprise AI Workloads

Optimize Large-Scale AI Workloads with NVIDIA Spectrum-X

Managing Data Centers Securely and Intelligently with NVIDIA UFM Cyber-AI

SC20 Demo: Revolutionizing Supercomputing with NVIDIA Mellanox UFM Cyber AI

ISC20 Featured Demo: Taking a Closer Look at NVIDIA Mellanox UFM

Related posts

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 New Digital Twin Products Developers Can Use to Build 6G Networks

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

Updating Classifier Evasion for Vision Language Models