Data Center / Cloud

Efficient ETL with Polars and Apache Spark on NVIDIA Grace CPU

Mar 11, 2025

By Gregory Kimball, Ivan Goldwasser, Lawrence Mitchell and Zach Puller

Discuss (0)

AI-Generated Summary

Dislike

The NVIDIA Grace CPU Superchip provides high-performance processing for data center and cloud CPU workloads with its Arm Neoverse V2 cores, Scalable Coherency Fabric, and LPDDR5X memory, making it ideal for Extract, Load, Transform (ETL) workloads.
For single-node Polars data processing, NVIDIA Grace CPU delivered a 25% speedup compared to the fastest x86 CPU under test, AMD Turin, and consumed 65% less energy than equivalent x86 CPU servers.
NVIDIA Grace CPU brings significant advantages for ETL workloads, offering up to 2.7x better performance per watt and 1.6x better performance per dollar compared to the latest x86 CPUs, making it a cost-effective and energy-efficient solution.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The benefits of NVIDIA Grace include high-performance Arm Neoverse V2 cores, fast NVIDIA-designed Scalable Coherency Fabric, and low-power high-bandwidth LPDDR5X memory.

These features make the Grace CPU ideal for data processing with Extract, Load, Transform (ETL) workloads, showing world-class performance. ETL workloads are a critical component of online analytical processing (OLAP) and business intelligence (BI) workflows that enable enterprises to gain insights and improve organizational decision-making.

This post explains how the NVIDIA Grace CPU delivers a solution that lowers power consumption when running ETL workloads on single-node Polars and multinode Apache Spark—without compromising performance.

Single-node Polars on CPU

Polars is an open-source library for data processing. It provides high performance for single-node workloads through its Python API. Polars publishes a PDS benchmark through their pola-rs/polars-benchmark GitHub repo, with implementations of several analytics queries derived from TPC-H.

Results obtained using PDS are not comparable to published TPC-H Benchmark results, as the results obtained from using PDS do not comply with the TPC-H Benchmarks. The PDS benchmark includes 22 queries that are implemented using Polars LazyFrame operations, allowing the Polars optimizer to apply predication, projection pushdown, and other optimizations. The test used Polars version 1.22.0 and enabled the environment variable POLARS_FORCE_NEW_STREAMING=1.

Query runtime data was collected at scale factor 100 (SF100 = 100 GB) with hot cache parquet data source. The Intel Sapphire Rapids system used a Xeon Platinum 8480CL CPU with 112 logical cores and 2 TB of DDR5 system memory. The AMD Turin system used an EPYC 9755 CPU with 256 logical cores, and 1.5 TB of DDR5 system memory. For Intel Sapphire Rapids and AMD Turin, the best runtime was observed when limiting to the physical cores of a single socket.

Finally, the NVIDIA Grace CPU system used an NVIDIA Grace CPU Superchip, featuring one NVIDIA Grace CPU with 72 physical cores and 120 GB of LPDDR5X system memory. All benchmarks were run on a single socket. The x86 CPU benchmarks were run enabling transparent huge pages (THP) with the environment variable _RJEM_MALLOC_CONF=thp:always.

For the PDS SF100 benchmark, our team observed a 25% speedup of NVIDIA Grace CPU versus AMD Turin 1S, the fastest x86 CPU under test. The source of the speedup is not driven by thread count, clock speed, cache bandwidth, or memory bandwidth.

Instead, we observe advantages when using the 64K default page size with Grace versus both the 4K default page size with x86 and the 2 MB THP on x86. For AMD Turin, we observed 86 seconds runtime with default settings on a two-socket (2S) machine. Limiting the execution to one socket improved the runtime to 60 seconds, and enabling THP plus disabling hyperthreading brought the runtime down to the 41 seconds (Figure 1).

For the data processing workload in PDS, the out-of-the-box configuration of NVIDIA Grace CPU delivered the best performance. Refer to the Grace Performance Tuning Guide for more information about page size and other configuration options for NVIDIA Grace.

For Polars PDS SF100, servers using NVIDIA Grace CPUs benefit from even larger improvements in energy usage, with an estimated energy consumption that is 65% less than equivalent servers with x86 CPUs. The energy consumption analysis is based on 2S servers that are running two instances of the PDS SF100 workload. The energy consumption estimates use 555 W for NVIDIA Grace CPU Superchip, 1,120 W for AMD Turin, and 1,050 W for Intel Sapphire Rapids.

NVIDIA Grace delivers incredible value relative to the competition with 2.7x better performance per watt and 1.6x better performance per dollar.

Multinode Apache Spark on CPU

Apache Spark is a popular and reliable engine for executing data engineering, data science, and machine learning workloads on multinode clusters. NVIDIA open-sourced an NDS benchmark toolset at the NVIDIA/spark-rapids-benchmarks GitHub repo, with scripts to run decision support queries derived from TPC-DS.

NDS supports both CPU execution using Spark and GPU execution using the RAPIDS Accelerator For Apache Spark plugin. Note that any results obtained using NDS are not comparable to published TPC-DS Benchmark results, as the results obtained from using NDS do not comply with the TPC-DS Benchmarks.

The test used Spark version 3.3.3 and executed 99 queries in sequence, with queries 14, 23, 24, and 39 split into two parts. Query runtime data was collected at scale factor 3,000 (SF3K = 3 TB) with an HDFS (Hadoop Distributed File System) data source.

Two clusters were used to assess Spark performance for the NDS SF3K workload. The first cluster used eight nodes, each with one AMD Genoa EPYC 9354 CPU and 528 GB of system memory. The second cluster also used eight nodes, each with one NVIDIA Grace CPU Superchip and 240 GB of LPDDR5 system memory.

For the NDS SF3K benchmark, our team observed similar runtime performance for both eight-node clusters, with the NVIDIA Grace CPU cluster nearly matching the AMD Genoa cluster.

However, when factoring in estimated energy consumption values of 555 W for each Grace node and 795 W for each Genoa node, the NVIDIA Grace CPU cluster delivers almost 40% more performance at the same power compared to an AMD Genoa cluster.

Summary

ETL workloads are critical for today’s organizations to gain insights into their data. The performance characteristics emphasize large amounts of data movement, frequent communication, and limited opportunities for vectorization. The Grace architecture is optimized a range of data analytics included ETL workloads with high-performance cores, a fast fabric and massive memory bandwidth coupled with higher default page size and lower energy consumption

NVIDIA Grace CPU brings lower TCO for ETL workloads in the data center with up to 2.7x performance per watt and 1.6x performance per dollar advantages against the latest generation of x86 CPUs.

Deploying NVIDIA Grace for ETL workloads will deliver leading performance while saving on power consumption and enable customers to utilize those power savings for AI capabilities.

Transitioning to Arm-based NVIDIA Grace also enables tightly coupled CPU and GPU architectures with products such as the NVIDIA GB200 Grace Blackwell Superchip in the NVIDIA GB200 NVL72. With Grace, data centers can standardize on a single CPU architecture that also works across the entire Arm ecosystem.

Learn more about the NVIDIA Grace CPU, including software and system setup.

Discuss (0)

About the Authors

About Gregory Kimball
Gregory Kimball is a software engineering manager at NVIDIA working on the RAPIDS team. Gregory leads development for libcudf, the CUDA/C++ library for columnar data processing that powers RAPIDS cuDF. Gregory holds a PhD in applied physics from the California Institute of Technology.

View all posts by Gregory Kimball

About Ivan Goldwasser
Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University.

View all posts by Ivan Goldwasser

About Lawrence Mitchell
Lawrence Mitchell is a principal systems software engineer on the RAPIDS team at NVIDIA where his focus is on high-productivity, high-performance libraries for data analytics. Prior to joining NVIDIA he was a lecturer in Computer Science and Applied Mathematics at the University of Durham with research interests in high performance simulation of continuum mechanics and structure-preserving numerical methods, and was the founding co-lead and technical architect of the open source Firedrake project for finite element simulation. He holds a PhD in Physics from the University of Edinburgh.

View all posts by Lawrence Mitchell

About Zach Puller
Zach Puller is a senior systems software engineer at NVIDIA developing the RAPIDS Accelerator for Apache Spark. Prior to NVIDIA, he worked for Stripe on the Data Platform team using Apache Spark and Trino.

View all posts by Zach Puller

Efficient ETL with Polars and Apache Spark on NVIDIA Grace CPU

Single-node Polars on CPU

Multinode Apache Spark on CPU

Summary

Tags

About the Authors

Comments