Data Center / Cloud

Efficient ETL with Polars and Apache Spark on NVIDIA Grace CPU

The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The benefits of NVIDIA Grace include high-performance Arm Neoverse V2 cores, fast NVIDIA-designed Scalable Coherency Fabric, and low-power high-bandwidth LPDDR5X memory. 

These features make the Grace CPU ideal for data processing with Extract, Load, Transform (ETL) workloads, showing world-class performance. ETL workloads are a critical component of online analytical processing (OLAP) and business intelligence (BI) workflows that enable enterprises to gain insights and improve organizational decision-making.

This post explains how the NVIDIA Grace CPU delivers a solution that lowers power consumption when running ETL workloads on single-node Polars and multinode Apache Spark—without compromising performance.

Single-node Polars on CPU

Polars is an open-source library for data processing. It provides high performance for single-node workloads through its Python API. Polars publishes a PDS benchmark through their pola-rs/polars-benchmark GitHub repo, with implementations of several analytics queries derived from TPC-H. 

Results obtained using PDS are not comparable to published TPC-H Benchmark results, as the results obtained from using PDS do not comply with the TPC-H Benchmarks. The PDS benchmark includes 22 queries that are implemented using Polars LazyFrame operations, allowing the Polars optimizer to apply predication, projection pushdown, and other optimizations. The test used Polars version 1.22.0 and enabled the environment variable POLARS_FORCE_NEW_STREAMING=1.

Query runtime data was collected at scale factor 100 (SF100 = 100 GB) with hot cache parquet data source. The Intel Sapphire Rapids system used a Xeon Platinum 8480CL CPU with 112 logical cores and 2 TB of DDR5 system memory. The AMD Turin system used an EPYC 9755 CPU with 256 logical cores, and 1.5 TB of DDR5 system memory. For Intel Sapphire Rapids and AMD Turin, the best runtime was observed when limiting to the physical cores of a single socket. 

Finally, the NVIDIA Grace CPU system used an NVIDIA Grace CPU Superchip, featuring one NVIDIA Grace CPU with 72 physical cores and 120 GB of LPDDR5X system memory. All benchmarks were run on a single socket. The x86 CPU benchmarks were run enabling transparent huge pages (THP) with the environment variable _RJEM_MALLOC_CONF=thp:always.

Bar chart showing the sum of query runtime for Intel Sapphire Rapids, AMD Turin, and NVIDIA Grace CPU.
Figure 1. Query runtime by CPU model for the 22 queries in PDS SF100

For the PDS SF100 benchmark, our team observed a 25% speedup of NVIDIA Grace CPU versus AMD Turin 1S, the fastest x86 CPU under test. The source of the speedup is not driven by thread count, clock speed, cache bandwidth, or memory bandwidth. 

Instead, we observe advantages when using the 64K default page size with Grace versus both the 4K default page size with x86 and the 2 MB THP on x86. For AMD Turin, we observed 86 seconds runtime with default settings on a two-socket (2S) machine. Limiting the execution to one socket improved the runtime to 60 seconds, and enabling THP plus disabling hyperthreading brought the runtime down to the 41 seconds (Figure 1). 

For the data processing workload in PDS, the out-of-the-box configuration of NVIDIA Grace CPU delivered the best performance. Refer to the Grace Performance Tuning Guide for more information about page size and other configuration options for NVIDIA Grace.

Bar chart showing the energy usage in Wh for servers using Intel Sapphire Rapids, AMD Turin, and NVIDIA Grace CPU.
Figure 2. Energy usage in watt-hours (Wh) for servers using CPU model for the 22 queries in PDS SF100

For Polars PDS SF100, servers using NVIDIA Grace CPUs benefit from even larger improvements in energy usage, with an estimated energy consumption that is 65% less than equivalent servers with x86 CPUs. The energy consumption analysis is based on 2S servers that are running two instances of the PDS SF100 workload. The energy consumption estimates use 555 W for NVIDIA Grace CPU Superchip, 1,120 W for AMD Turin, and 1,050 W for Intel Sapphire Rapids.

NVIDIA Grace delivers incredible value relative to the competition with 2.7x better performance per watt and 1.6x better performance per dollar.

Bar chart showing the performance per dollar and performance per watt for NVIDIA Grace CPU and AMD Turin servers.
Figure 3. Performance per dollar and performance per watt for servers using CPU model for the 22 queries in PDS SF100

Multinode Apache Spark on CPU

Apache Spark is a popular and reliable engine for executing data engineering, data science, and machine learning workloads on multinode clusters. NVIDIA open-sourced an NDS benchmark toolset at the NVIDIA/spark-rapids-benchmarks GitHub repo, with scripts to run decision support queries derived from TPC-DS. 

NDS supports both CPU execution using Spark and GPU execution using the RAPIDS Accelerator For Apache Spark plugin. Note that any results obtained using NDS are not comparable to published TPC-DS Benchmark results, as the results obtained from using NDS do not comply with the TPC-DS Benchmarks. 

The test used Spark version 3.3.3 and executed 99 queries in sequence, with queries 14, 23, 24, and 39 split into two parts. Query runtime data was collected at scale factor 3,000 (SF3K = 3 TB) with an HDFS (Hadoop Distributed File System) data source. 

Two clusters were used to assess Spark performance for the NDS SF3K workload. The first cluster used eight nodes, each with one AMD Genoa EPYC 9354 CPU and 528 GB of system memory. The second cluster also used eight nodes, each with one NVIDIA Grace CPU Superchip and 240 GB of LPDDR5 system memory.

Bar chart showing the energy usage for AMD Genoa and NVIDIA Grace C2 clusters.
Figure 4. Energy usage by CPU model in watt-hours for the 99 queries in NDS SF3K, run by Apache Spark on an eight-node cluster

For the NDS SF3K benchmark, our team observed similar runtime performance for both eight-node clusters, with the NVIDIA Grace CPU cluster nearly matching the AMD Genoa cluster. 

However, when factoring in estimated energy consumption values of 555 W for each Grace node and 795 W for each Genoa node, the NVIDIA Grace CPU cluster delivers almost 40% more performance at the same power compared to an AMD Genoa cluster.

Summary

ETL workloads are critical for today’s organizations to gain insights into their data. The performance characteristics emphasize large amounts of data movement, frequent communication, and limited opportunities for vectorization. The Grace architecture is optimized a range of data analytics included ETL workloads with high-performance cores, a fast fabric and massive memory bandwidth coupled with higher default page size and lower energy consumption 

NVIDIA Grace CPU brings lower TCO for ETL workloads in the data center with up to 2.7x performance per watt and 1.6x performance per dollar advantages against the latest generation of x86 CPUs. 

Deploying NVIDIA Grace for ETL workloads will deliver leading performance while saving on power consumption and enable customers to utilize those power savings for AI capabilities.

Transitioning to Arm-based NVIDIA Grace also enables tightly coupled CPU and GPU architectures with products such as the NVIDIA GB200 Grace Blackwell Superchip in the NVIDIA GB200 NVL72. With Grace, data centers can standardize on a single CPU architecture that also works across the entire Arm ecosystem.

Learn more about the NVIDIA Grace CPU, including software and system setup.

Discuss (0)

Tags