Accelerating Bare Metal Kubernetes Workloads, the Right Way

This post was originally published on the Mellanox blog.

In my previous Kubernetes post, Provision Bare-Metal Kubernetes Like a Cloud Giant!, I discussed the benefits of using BlueField DPU-programmable SmartNICs to simplify provisioning of Kubernetes clusters in bare-metal infrastructures. A key takeaway from this post was the current rapid shift toward bare metal Kubernetes, for delivering high-performance workloads across public, on-premises, and edge environments.

This topic is still trending as we see telco giants AT&T, Verizon, and China Mobile, among others, investing in bare-metal cloud infrastructures to deliver enhanced digital experiences.

Going into KubeCon San Diego next week and following the introduction of the new ConnectX-6 Dx DPU based SmartNICs and BlueField-2 DPU based programmable SmartNICs, this post provides updates about our path to deliver high-throughput, low-latency Kubernetes network solutions at scale.

Accelerating cloud-native ML/AI applications

Kubernetes plays an important role in the emergent ML/AI application ecosystem as new applications are built from the ground up as microservices. Mellanox ConnectX and BlueField DPU-based SmartNICs offer in-hardware acceleration for remote direct memory access (RDMA) or RoCE communications, delivering best-in-class AI application performance and usability.

Partnering with the Kubernetes Network Plumbing working group, our team has successfully delivered an open-source, generic CNI and device plugin for attaching SR-IOV network interfaces with RDMA support to a Kubernetes POD.

What’s more, Red Hat has recently released its flagship OpenShift Container Platform 4.2 with inbox support for RDMA/RoCE communications over Mellanox ConnectX-4 Lx and ConnectX-5 NICs. Red Hat OpenShift incorporates the community’s work in the space and delivers an enterprise experience for deploying ML/AI applications on bare-metal computing infrastructures, with enhanced performance and efficiency.

It was fascinating to witness the NVIDIA announcement of its revolutionary EGX Supercomputer platform at MWC Los Angeles, which puts Kubernetes right in the middle of the stack, to simplify how organizations deploy, manage and scale AI applications at the network’s edge. NVIDIA also published a list of NVIDIA NGC-Ready for Edge systems, many of which support Mellanox SmartNICs with built-in RDMA/RoCE accelerators, including the HPE ProLiant DL380 Gen10, Dell PowerEdge R640, and more.

Today, with RDMA/RoCE integrated into the mainstream code of popular AI/ML frameworks, including TensorFlow, MXNet, Microsoft Cognitive Toolkit, I expect to see more Kubernetes and OpenShift deployments that take advantage of the predictable and scalable performance that RDMA delivers.

Accelerating cloud-native networking

DevOps engineers on a Kubernetes journey frequently ask which networking model would work best for their cluster. Kubernetes does not provide a default networking model, but there are ample stacks to choose from, including open-source and commercial options.

As a leading contributor to the Open vSwitch (OVS) community, Mellanox opted to integrate our advanced ASAP² (Accelerated Switch and Packet Processing) technology with the popular Open Virtual Network (OVN) networking model for Kubernetes, which also uses OVS. OVN complements the existing capabilities of OVS with native virtual networking support and delivers production-quality implementation that can operate at scale. The choice of OVN also aligns with the Red Hat OpenShift roadmap, which recently introduced OVN support as a technical preview feature.

Mellanox ASAP²advanced switching and packet processing technology is built into the Mellanox SmartNICs and delivers breakthrough cloud networking performance.

At the heart of ASAP² is the eSwitch—an embedded switch built into Mellanox SmartNICs.The beauty of the eSwitch lies in how it allows the SmartNICs to handle a large portion of the packet processing operations in the hardware, freeing upworlds—the performance and efficiency of bare-metal server networking hardware with the flexibility of software-defined networking (SDN).

The next step was to enable the OVN CNI to take advantage of the OVS hardware offload capabilities. The team is currently working to integrate the solution with the OVS connection tracking module, to allow the NIC hardware to offload NAT functions that are the basis to establish pod-to-pod communication in a Kubernetes cluster.

We look forward to introducing the complete solution in the Q1 timeframe, and bringing accelerated Kubernetes network connectivity coupled with the advanced SDN features.

Securing cloud-native workloads

Kubernetes security is an immense challenge comprised of many highly interrelated parts. The shift from a monolithic model to today’s prominent microservices architecture has completely transformed the way enterprises ship applications at scale. At the same time, cloud-native applications generate intensive data movements between services and physical machines to satisfy a single application request.

The amount of traffic and latency requirements often prohibit the use of zero-trust security solutions to avoid the risk of impacting application performance. Thus, this creates inherent challenges within the enterprise for both the DevOps team, whose task is to ensure high-quality application delivery, and the security team, whose primary goal is to protect customer data and privacy.

The Mellanox ConnectX-6 Dx now being shipped, and the recently-introduced BlueField-2 DPU-based programmable SmartNIC solutions provide hardware-accelerated, software-defined crypto capabilities that are ideally positioned to secure cloud-native workloads.

The Kubernetes platform and its vibrant ecosystem of popular, third-party components including microservices firewalls, service mesh platforms and ingress gateways widely use Transport Layer Security (TLS) for data-in-motion encryption between different system components.

A notable example, natively featured in the platform, is to encrypt the Kubernetes API. A more advanced solution would be to introduce pod-to-pod communication encryption as featured with the leading service mesh platforms, Istio and Linkerd.

The team is currently researching various platforms and integration schemes to leverage the cutting-edge TLS encryption acceleration engines in our SmartNICs. This will allow us to secure cloud-native workloads at 100 Gb/s, full wire speed!

Summary

We are excited to partner with the cloud-native ecosystem on advanced Kubernetes network solutions that scale beyond the public cloud to on-premises and edge data centers.

If you’re attending KubeCon San Diego, I’d be happy to get together to learn about your goals and further the advancements of cloud native computing. Drop me a note at itayo@mellanox.com to set up a meeting.

For more information about how Mellanox SmartNICs accelerate cloud-native ML/AI applications, see Enabling Scalable and Super-fast Kubernetes Networking for AI.
For more information about Mellanox SmartNICs, see ConnectX Ethernet Adapters.