North–South Networks: The Key to Faster Enterprise AI Workloads

In AI infrastructure, data fuels the compute engine. With evolving agentic AI systems, where multiple models and services interact, fetch external context, and make decisions in real time, enterprises face the growing challenge of moving massive amounts of data quickly, intelligently, and reliably. Whether it is loading a model from persistent storage, retrieving knowledge to support a query, or orchestrating agentic tool use, data movement is central to AI performance.

GPU-to-GPU (east-west) communication has long been a focus of optimization. However, equally critical are the north-south networks—handling model loading, storage I/O, and inference queries—where performance bottlenecks can directly impact the responsiveness of AI systems.

NVIDIA Enterprise Reference Architectures (Enterprise RAs) guide organizations on how to effectively deploy AI factories that use north-south networks. They are design recipes that help organizations build scalable, secure, and high-performing AI factories. Providing clear, validated pathways for deploying complex AI infrastructure, Enterprise RA’s distill NVIDIA’s extensive experience into actionable recommendations, from server and network configurations to software stacks and operational best practices.

Among the many components of Enterprise RAs, NVIDIA Spectrum-X Ethernet deserves particular attention for its role in accelerating north-south data flows, especially for data-intensive AI use cases with NVIDIA BlueField-3 DPUs (data processing units).

Legacy Ethernet storage networks, not built for the scale, data flows, and sensitivity of accelerated AI and HPC workloads, often introduce latency and congestion that degrade performance. Every time an AI model checkpoints its progress mid-training, it moves massive amounts of data across north-south pathways to persistent storage (learn how the NVIDIA-Certified Storage Program complements the Enterprise RA program). These checkpoint files, which can span several terabytes for today’s billion-parameter models, ensure that progress isn’t lost when systems go down.

Inference workloads rely just as heavily on north-south efficiency. When an AI agent retrieves data, whether it’s embeddings from a retrieval-augmented generation (RAG) vector database or external context from a tool or database for a customer query, it depends on fast, latency-sensitive north-south connectivity. As enterprises shift from static one-shot inference to dynamic, multi-turn, multi-agent inference, this amplifies north-south networking demands by another order of magnitude. This happens as agents ingest, process, and update data by continuously interacting with users, external sources, and cloud services.

By using NVIDIA Spectrum-X Ethernet for accelerated data movement in Enterprise RAs, these networks become lossless AI data storage and movement fabrics—purpose-built for the performance demands of modern AI workloads. This enterprise-ready architecture enables the creation of AI factories optimized for predictable, high-throughput, low-latency data access, unlocking the full potential of modern AI workflows.

Converged networking: a simplified foundation for Enterprise AI workloads

Enterprise AI factories are often built to address a defined set of use cases, with networks typically starting in the range of 4 to 16 server nodes. In this scenario, a converged design that consolidates east-west traffic, such as compute, and north-south traffic, such as storage and external services, into a unified switch fabric helps streamline operations. This design reduces complexity by minimizing cabling and hardware sprawl yet ensures consistent high-throughput performance across training, inference, and retrieval workloads. But a converged east-west/north-south network requires networking that can deliver sufficient bandwidth and quality of service (QoS) to support both types of traffic.

Spectrum-X Ethernet, which sits at the heart of Enterprise RAs, plays a key role. While originally optimized for east-west GPU-to-GPU and node-to-node communication, it delivers bandwidth and performance benefits to north-south networks and the storage data path by using adaptive routing and telemetry to prevent congestion, increase throughput, and reduce latency during AI runtime and retrieval-heavy workloads.

Equally important are Spectrum-X Ethernet capabilities like virtual routing and forwarding (VRF) service separation and quality of service (QoS) traffic prioritization. VRFs logically segment east-west communication from north-south traffic, such as user ingress or storage access, without requiring physical network segmentation. QoS appends tags to the Ethernet frame or IP packet headers to ensure specific traffic is prioritized depending on the use case (e.g., storage traffic over HTTPS user traffic). These mechanisms are further reinforced by advanced features such as noise isolation, which ensure consistent performance when multiple AI agents or workloads are running concurrently on shared infrastructure.

It’s important to note that while convergence is well-suited for enterprise-scale AI factories, it isn’t a one-size-fits-all approach. In large-scale, multi-tenant environments, such as those operated by NVIDIA Cloud Partners (NCPs), a disaggregated network model with physically connected networks may be preferred to ensure the highest effective bandwidth and enable stricter isolation between tenants and traffic types.

Converged networking is a deliberate design choice that aligns with the enterprise-scale use case, performance, and manageability needs of dedicated AI infrastructure. Enterprise RAs break down the complex task of determining the optimal network architecture for a specific use case by offering a range of instructions for small foundation clusters to larger deployments that scale to 1k GPUs.

Understanding the role of NVIDIA Ethernet SuperNICs and BlueField-3 DPUs

To understand how networking is orchestrated in an AI factory, it’s helpful to distinguish between the roles of NVIDIA Ethernet SuperNICs and DPUs. NVIDIA SuperNICs are built specifically to handle the east-west traffic that dominates GPU-to-GPU communication. Designed for hyperscale AI environments, they deliver up to 800 Gb/s of bandwidth per GPU, ensuring ultra-fast data connectivity during distributed training and inference.

Complementing this, BlueField-3 DPUs take charge of north-south traffic. BlueField-3 offloads, accelerates, and separates tasks such as storage management, telemetry, and network security from the host CPU, freeing up valuable compute resources for core AI processing. In effect, it acts as a specialized cloud infrastructure processor that ensures data moves efficiently between the AI factory and its external ecosystem, including networked storage.

Together, SuperNICs and BlueField-3 DPUs form a powerful symphony of AI networking. SuperNICs fuel and route the internal computations of the AI factory, while BlueField-3 DPUs ensure that external data feeds arrive smoothly and at scale. This dual approach enables enterprises to optimize performance across all layers of their AI infrastructure.

The enterprise impact: vector databases and real-time retrieval

A relatable example of north-south networking is found in the growing adoption of agentic AI and RAG systems. Architectures, such as the NVIDIA RAG 2.0 Blueprint, extend the capabilities of large language models (LLMs) by integrating external knowledge such as documents, images, logs, and videos. The RAG Blueprint uses NVIDIA NeMo Retriever and NVIDIA NIM microservices to embed, index, and retrieve this content using vector databases, providing more accurate and contextually relevant responses.

When a user submits a query, the LLM creates a vector embedding, which is used to rapidly query a vector database such as Milvus, sitting in external storage, for the most relevant embedded context. This interaction hinges on fast (low-latency) north-south data flow. The sooner the system retrieves and integrates this external knowledge, the faster and more precise its response. A converged Spectrum-X Ethernet network optimizes this data path, ensuring minimal latency and maximum throughput as models fetch embeddings in real time.

Diagram of user query flow through NVIDIA Spectrum-X Ethernet networking. — ***Figure 1.*** *Step-by-step flow of a RAG-enhanced LLM user query through the NVIDIA Spectrum-X Ethernet networking platform*

Let’s examine the north-to-south user-compute-storage flow:

User query ingress (user to Internet to leaf): A user prompt or task flows into the AI factory through an ingress gateway, hits the leaf switch, and descends into the cluster. Enterprise RAs streamline this path with Spectrum-X Ethernet, reducing the time to first token (TTFT) for applications relying on external data and avoiding manual network configuration tuning.
Request routed to GPU Server (leaf to GPU via DPU): The request is directed to a GPU node by the leaf switch, where a BlueField-3 DPU handles packet parsing, offloads the networking stack, and routes the query to the correct inference engine (e.g., NVIDIA NIM). The request flows across the leaf-spine Spectrum-X Ethernet network switch using adaptive routing to avoid congestion. Spectrum-X Ethernet uses the real-time state of the switch or queue occupancy to dynamically keep traffic flowing efficiently, similar to how a map app reroutes you around traffic jams.
External context fetch (server to leaf to spine to leaf to storage): For context queries (e.g., vector databases), the request flows through the leaf-spine fabric via RoCE (RDMA over Converged Ethernet) to an NVMe-based storage system. Spectrum-X Ethernet features seamless interoperability and optimized performance for AI workloads accessing data on partner platforms such as DDN, VAST Data, and WEKA, delivering up to 1.6x faster storage performance.
Data returned to GPU (storage to leaf to spine to leaf to server): The relevant vectors and embedded content are returned over the same converged fabric via RoCE. Spectrum-X Ethernet enables this path to be congestion-aware, with the DPU handling packet reordering to keep the GPU fed efficiently. Here, QoS markings can ensure that latency-sensitive storage data is prioritized, especially when many AI agents are querying multiple tools over north-south traffic.
LLM inference and final response (GPU to leaf to user): With both the original prompt and relevant external context in memory, the GPU completes inference. The final response is routed upwards and exits the infrastructure back to the user application. VRF-based network isolation ensures storage, inference, and user traffic stay logically independent, ensuring stable performance at scale.

In environments where several AI agents operate concurrently—collaborating to solve complex tasks or serving multiple user queries—efficient north-south networking prevents bottlenecks and maintains a fluid, responsive system. By streamlining these retrieval processes, enterprises unlock faster decision-making and improved user experiences. Whether in customer support chatbots, financial advisory tools, or internal knowledge management platforms, agentic AI and RAG architectures powered by efficient north-south networks deliver tangible business value.

In conclusion, AI workloads are no longer confined to massive training clusters tucked away in isolated environments. Increasingly, they are embedded into the fabric of everyday enterprise operations, requiring seamless interaction with data lakes, external services, and user-facing applications. In this new paradigm, north-south networks are making a comeback as the heroes of AI factories. With the combined strengths of NVIDIA Spectrum-X Ethernet, NVIDIA BlueField, and thoughtful NVIDIA Enterprise RA-based designs, organizations can ensure their AI factories are resilient, performant, and ready to scale as AI workloads evolve.

For additional information about solutions based on NVIDIA Enterprise RAs, please consult your NVIDIA-Certified partner for tailored deployment strategies.

Learn more: