Data Center / Cloud

Enhancing AI Cloud Data Centers and NVIDIA Spectrum-X with NVIDIA DOCA 2.7

The NVIDIA DOCA acceleration framework empowers developers with extensive libraries, drivers, and APIs to create high-performance applications and services for NVIDIA BlueField DPUs and SuperNICs. DOCA 2.7 is a comprehensive, feature-rich release that further underpins the scope and value of the DOCA software framework. It offers several new libraries, turn-key applications, and fully featured services. 

DOCA 2.7 extends the role of BlueField DPUs in offloading, accelerating, and isolating network, storage, security, and management infrastructure within the data center. The release also enhances AI cloud data centers and accelerates the NVIDIA Spectrum-X networking platform, delivering superior performance for AI workloads. 

Highlights of the release detailed in this post include:

  • Support for Spectrum-X RA 1.0.1 with BlueField-3 SuperNIC
  • DOCA PCC, DOCA Flow and OVS DOCA enhancements 
  • Updated AI cloud traffic encryption – IPsec GA, PSP support (Beta)
  • New DOCA libraries
  • DOCA services enhancements
  • New DOCA Management Service (DMS)

NVIDIA Spectrum-X RA 1.0.1 with BlueField-3 SuperNIC

DOCA 2.7 enables NVIDIA Spectrum-X 1.0.1 (SPC-X 1.0.1) reference architecture for Ethernet AI Cloud deployments. This architecture has been rigorously tested and optimizes the functionality of BlueField SuperNICs and Spectrum SN5600 switches for accelerating and managing E-W Ethernet traffic in AI clusters.

New features in DOCA 2.7 for BlueField-3 SuperNICs include:

  • Lossless RoCE using adaptive routing and DOCA programmable congestion control (PCC)
  • DOCA-PCC optimized for AI workloads running on SPC-X 1.0.1   
  • BlueField SuperNICs shipped in NIC mode as default 

This is currently used in the NVIDIA Israel-1 Supercomputer, with wider adoption underway by several AI cloud CSPs.

DOCA-PCC 

The DOCA PCC library provides a high-level programming interface that enables you to implement your own customized congestion control (CC) algorithm. This library uses the NVIDIA BlueField-3 SuperNIC acceleration for CC management and provides an API that simplifies hardware complexity. This frees you to focus on the functionality of your CC algorithm.

DOCA PCC also provides the flexibility to develop an optimal solution to handle and avoid network congestion in clusters. Every network is different and not all can use standard off-the-shelf congestion control solutions. Customized congestion control is essential for AI workflows, enabling performance isolation, improving fairness, and consistently low latency, while preventing packet drop on lossy networks.

DOCA 2.7 offers a range of features designed to optimize congestion control. These help to monitor network performance, diagnose issues, and collect telemetry data. For example, Notification Point (NP) programmability can be used to trigger alerts or actions when congestion-related events occur. DOCA 2.7 also supports multiple probe packets, data packets used for monitoring and telemetry to enhance network visibility. 

While these features aren’t unique to congestion control, they help to diagnose congestion-related issues and improve overall network health. In addition, other telemetry information gained from monitoring Spectrum switch port speed capacity reduces the likelihood of port oversubscription. Tracking transmitted/received (Tx/Rx) bytes at the NIC (end point) ports can help to reveal congestion patterns.

VirtIO-net devices

DOCA 2.7 now supports up to 2K functional VirtIO-net devices for Bluefield-3 DPUs. This is ideal for situations that require the availability of many low-capacity and low-usage active devices (multiple endpoints that need webpage access, for example). CSPs and organizations employing public/private clouds can use this feature to help scale multi-tenant environments.

DOCA Flow

DOCA FLOW provides the building blocks to simplify the development of network applications for software-defined networking and software-defined security necessary to offload, accelerate, and isolate these functions to the BlueField-3 DPU. As a steering library for the offload and acceleration of network steering pipelines, DOCA Flow enables fast pipeline programmability of SDN services.

DOCA FLOW features new in DOCA 2.7 include: 

  • DOCA-Connection Tracking (CT) to improve pipeline performance, efficiency, and flexibility
  • DOCA Flow Pipeline Visualization for debugging (Alpha)
  • LPM pipe enhancement to support VLAN-based traffic

Central to the development of DOCA, changes to DOCA Flow focus on improving feature performance and user experience, yielding higher scale and better performance with DOCA-FLOW apps, and providing debugging and performance tools for DOCA-FLOW developers.

OVS DOCA

OVS DOCA is a highly optimized virtual switch for NVIDIA network services. It’s an extremely efficient design that promotes next-generation performance and scale using an NVIDIA NIC or DPU. Based on Open vSwitch, OVS DOCA offers the same northbound API, OpenFlow, CLI, and data interface, ensuring a drop-in replacement alternative to OVS. 

OVS DOCA enables faster implementation of the NVIDIA future innovative networking features. As a customizable service with source code available, OVS DOCA powers HBN and other NVIDIA services suitable for Ethernet switching. 

DOCA 2.7 includes several enhancements to further optimize OVS-DOCA. For example, DOCA enables you to unify representors for multiple ports, providing better resource utilization and scale. This means that by unifying representors, a greater number of ports can be managed more efficiently, thereby reducing overhead and simplifying configuration. In turn, a single representor handling multiple ports helps to achieve better scalability. This is crucial in large-scale deployments.

In addition, the inclusion of a Hairpin Offload optimizes traffic flow between virtual machines (VMs) or containers on the same host. This eliminates the need to route traffic externally to the physical host, reduces latency, and promotes faster data exchange–ultimately improving the overall system performance

Another new feature to DOCA 2.7 called Slow Path Metering enables monitoring and control of non-accelerated traffic. This improves security and resource optimization, and provides fine-grained control for administrators to set policies for specific types of traffic, tailoring network behavior.

DOCA host-based networking  

HBN is a DOCA service that enables the network architect to design a network purely on L3 protocols. This enables routing to run on the server side of the network (instead of on switches) by using the DPU as a BGP router. The EVPN extension of BGP, supported by HBN, extends the L3 underlay network to multi-tenant environments with overlay L2 and L3 isolated networks. 

The HBN solution packages a set of network functions inside a container, which itself is packaged as a service pod that runs on the DPU. This is beneficial to bare-metal CSPs, telcos, and enterprise customers.

HBN features new in DOCA 2.7 include:

  • Single-port BlueField-3 SuperNIC support
  • GA-level support for local VRF route leaking 
  • EVPN downstream VNI (DVNI) for symmetric EVPN route leaking
  • Layer-3 VLAN subinterfaces with VRF-Lite
  • Network-to-network hairpin routing support on BF uplinks
  • GA-level support for Stateful ACLs over L2 VXLAN
  • Initial support for VLAN trunk on host-facing interfaces 

This update offers several immediate benefits by enabling GPU E/W fabric use cases with single-port BlueField SuperNICs.

The DOCA 2.7 features not only improve the scalability and efficiency of shared services and Internet access for isolated tenants, but also enable BlueField DPUs to be used as EVPN overlay gateways. This provides external connectivity for multi-tenant clouds.

DOCA SNAP encryption at rest with zero-copy

The DOCA SNAP v4 service on BlueField-3 adds inline AES-XTS offloads. AES-XTS is the de facto cryptographic algorithm for protecting the confidentiality of data-at-rest on storage devices. SNAP now accelerates AES-XTS encryption in hardware, which optimizes and improves the encryption process while benefiting from a reduced CPU overhead.

SNAP encryption of data at rest, based on AES-XTS, is now made available to SPDK API and SNAP RPC with zero-copy, meaning the stored data can be encrypted and decrypted without making an extra copy of it in memory. Typical customers include those seeking protected performance improvements using the latest generations of DDR, LPDDR, GDDR, and HBM memory interfaces.

DOCA SNAP features new in DOCA 2.7 include:

  • Availability with BlueField-3 with SNAP v4 service
  • Different encryption key per namespace using SPDK API 
  • Supported with NVMe-oF RDMA/RoCE
  • Integration with other standard and nonstandard protocols as look-aside

DOCA Firefly 

The DOCA Firefly service provides precision time synchronization services leveraging the hardware acceleration of NVIDIA DPUs. DOCA Firefly now includes industry-specific profiles to improve the user experience and simplify deployment. In addition to the existing media profile, DOCA 2.7 now offers a Telco Profile, including industry-specific functions and customized performance parameters.

This service has been adopted by customers in numerous industries, including telco, M&E, and FSI. It’s currently being used to drive the strict timing requirements at MSG-Sphere.

AI cloud traffic encryption and decryption 

DOCA 2.7 includes revisions to DOCA IPsec (now GA) and the introduction of DOCA PSP. 

DOCA, running on BlueField DPUs, can be used in several ways to improve the IPsec process while simultaneously accelerating the encryption and decryption of network traffic. New to this release, DOCA Flow now supports all IPsec modes and options, while offering full acceleration of the IPSec protocol.

Additional features include:

  • IPsec GA
  • Multi-threading support
  • Improved insertion rate 
  • API update that removes DOCA IPsec lib and merges its functionality into DOCA Flow

PSP is a new networking security protocol published by Google. This release is the first version of DOCA to support the PSP offering (in tech preview) full acceleration of the PSP protocol with DOCA Flow and inline PSP encapsulation and encryption/decryption in hardware. In contrast to IPsec, PSP is particularly suited for use in larger-scale AI clouds.

Example use cases for these features include: 

  • North-South AI cloud network encryption 
  • East-West AI cloud GPU-to-GPU traffic 
  • Non-AI cloud node-to-node encryption

DOCA UROM and DOCA DevEmu  

The new DOCA UROM library and service enables offloading of high-performance computing (HPC) and AI workloads. Specifically, HPC compute is performed by the host while HPC communication is accelerated and offloaded to BlueField DPUs. This helps to optimize CPU utilization, offering performance gains to AI training and inference, as well as HPC applications.

The DOCA Device Emulation Library (DOCA DevEmu) enables you to emulate a custom device on BlueField DPUs, and PCI-connect to it from the host. This offers several advantages, the most significant being access to other features linked to offloading or acceleration, but without requiring the host application to use the DOCA APIs directly.

DOCA Comm Channel for DPU

DOCA Comm Channel offers enhanced hardware-isolated communication between an untrusted host client app and BlueField software services. It enables innovative security and storage offload services.

DOCA Management Service 

New to DOCA 2.7, DOCA Management Service is a DOCA service that simplifies BlueField post-boot provisioning and configuration using standard configuration interfaces (API/CLI).

Key benefits:

  • Provides the same API for all tools, removing the need to know all the tools and their different syntax.
  • Removes the need for deep knowledge of low-level hardware details to configure NVIDIA network adapters.
  • Uses industry-standard configuration interfaces (CLI and API) and data models (like gRPC/gNMI and OpenConfig) to ensure better interoperability and ease of integration.
  • Simplifies the automation of DPU management tasks with a robust API designed for seamless integration with external automation systems and tools. 

To learn more about additional DOCA Platform upgrades, see the DOCA 2.7 release notes.

Summary

The NVIDIA DOCA framework enables the rapid creation and management of applications and services on top of the BlueField networking platform, leveraging industry-standard APIs. With DOCA, developers can deliver breakthrough networking, security, and storage performance by harnessing the power of NVIDIA BlueField DPUs and SuperNICs. 

The new features in DOCA 2.7 extend its broader value by strengthening the functionality and benefits offered by BlueField DPUs and SuperNICs in AI cloud data centers. The recent enhancements not only help to deliver superior performance for AI workloads, but also add extended security and networking capabilities. In combination, these improvements provide a robust platform for developers. DOCA 2.7 also underpins the NVIDIA Spectrum-X reference architecture with BlueField-3 SuperNIC.

Download NVIDIA DOCA to begin your development journey with all the benefits DOCA has to offer. To learn more, check out the following resources:

Discuss (0)

Tags