With the rapid growth of generative AI, CIOs and IT leaders are looking for ways to reclaim data center resources to accommodate new AI use cases that promise greater return on investment without impacting current operations. This is leading IT decision makers to reassess past infrastructure decisions and explore strategies to consolidate traditional workloads into fewer, more power-efficient nodes, freeing up data center power and space.
NVIDIA GH200 Grace Hopper Superchip is the first memory-converged CPU-GPU superchip designed from the ground up to meet the challenges of AI, high-performance computing, and data processing. By migrating Apache Spark workloads from CPU nodes to NVIDIA GH200, data centers and enterprises can accelerate query response time by up to 35x. For large Apache Spark clusters of 1,500+ nodes, this speedup translates to up to 22x fewer nodes and savings of up to 14 GWh in annual energy efficiency.
This post explores the architectural innovations of NVIDIA GH200 for data processing, shares SQL benchmark results for GH200, and provides insights on seamlessly migrating Apache Spark workloads to this new platform.
Tackling legacy bottlenecks in CPU-based Apache Spark systems
Over the last decade, enterprises have grappled with the overwhelming volumes of business, consumer, and IoT data, which are increasingly pivotal for maintaining a competitive edge within industries. To address this challenge, many enterprises have turned to Apache Spark, a multi-language open-source system used for big data distributed processing.
Apache Spark began as a research project at University of California, Berkeley with the goal of addressing the limitations of previous big data frameworks. It achieved this by caching data in CPU memory, which significantly accelerated SQL queries. Today, tens of thousands of organizations rely on Apache Spark for diverse data processing tasks spanning a wide array of industries including financial services, healthcare, manufacturing, and retail.
Despite its ability to alleviate the bottleneck of data access from slower hard disks and cloud-based object storage through memory caching, many Apache Spark data processing workflows still encounter constraints due to hardware limitations inherent in CPU architectures.
Pioneering a new era of converged CPU-GPU superchips
Recent advancements in storage and networking bandwidth, along with the ending of Moore’s law, have shifted the focus of analytics and query bottlenecks to CPUs. Meanwhile, GPUs have emerged as the preferred platform for Deep Learning workloads due to their vast number of processing cores and high-bandwidth memory which excel in highly parallelized processing. Parallelizing Apache Spark workloads and running them on GPUs delivers order of magnitude speed-ups compared to CPUs.
Running Apache Spark workloads on GPUs formerly necessitated the transfer of data back and forth between the host CPU and GPU—traditionally bound by the 128 GB/s low speed PCIe interfaces. To overcome this challenge, NVIDIA developed NVIDIA Grace Hopper, a new class of superchips that bring together the Arm-based NVIDIA Grace CPU and NVIDIA Hopper GPU architectures using NVLink-C2C interconnect technology. NVLink-C2C delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than the standard PCIe Gen5 lanes found in traditional x86-based GPU accelerated systems.
With GH200, the CPU and GPU share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory that can reside on physical CPU or GPU memory. When adopted, this architecture removes the need to copy memory back and forth between the CPU and GPU.
NVIDIA GH200 sets new highs in NDS performance benchmarks
To measure the performance and cost savings of running Apache Spark on GH200, we used the NVIDIA Decision Support (NDS) benchmark. NDS is derived from the widely used and adopted CPU-only data processing TPC-DS benchmark. NDS consists of the same SQL queries included in TPC-DS with modifications only to data generation and benchmark execution scripts. NDS is not TPC-DS and NDS results are not comparable to official, audited TPC-DS results—only to other NDS results.
Running the 100+ TPC-DS SQL queries with NDS execution scripts on a 10 TB dataset took 6 minutes using 16 GH200 superchips compared to 42 minutes on an equal number of premium x86 CPUs: a 7x end-to-end speedup.
Specifically, queries that have a high number of aggregate and join operations exhibited significantly higher acceleration of up to 36x.
- Query67, accelerated by 36x, finds top stores for different product categories based on store sales in a specific year. It involves a high number of aggregate and shuffle operations.
- Query14, accelerated by 10x, calculates the sum of the extended sales price of store transactions for each item and a specific year and month. It involves a high number of shuffle and join operations.
- Query87, accelerated by 9x, counts how many customers have ordered items on the web, the catalog and bought items in a store on the same day. It involves a high number of scan and aggregate operations.
- Query59, accelerated by 9x, reports the increase of weekly store sales from one year to the next year for each store and day of the week. It involves a high number of aggregate and join operations.
- Query38, accelerated by 8x, displays the count of customers with purchases from all three channels in a given year. It involves a high number of distinct aggregate and join operations.
Reducing power consumption and cutting energy costs
As the datasets grow in size, GH200 delivers even more query acceleration and node consolidation benefits. Running the same 100+ queries on the 10x larger SF100 dataset (100 TB) required a total of 40 minutes on the 16-node GH200 cluster.
Achieving an equivalent 40-minute response time on the 100 TB dataset using premium CPUs would have required a total of 344 CPUs. This translates to a 22x reduction in the number of nodes and 12x energy savings. For organizations running a large Apache Spark CPU cluster, which can sometimes exceed 1,500 nodes, the energy savings are significant, reaching up to 14 GWh annually.
Exceptional SQL acceleration and price performance
HEAVY.AI, a leading GPU-accelerated analytics platform and database provider, benchmarked a single GH200 GPU cloud instance against an 8x NVIDIA A100 PCIe-based cloud instance running HeavyDB and the NDS-H benchmark.
HEAVY.AI reported an average 5x speedup using the GH200 instance, translating to a 16x cost savings on the SF100 dataset. On the larger SF200 dataset, which does not fit on a single GH200 GPU memory and has to be offloaded to the Grace CPU memory over the low latency high bandwidth NVLink-C2C, HEAVY.AI reported a 2x speedup and 6x cost savings compared to the 8 NVIDIA A100 x86 and PCIe-based instance.
“Our customers make data-driven, time-sensitive decisions that have a high impact on their business,” said Todd Mostak, CTO and co-founder of HEAVY.AI. “We’re excited about the new business insights and cost savings that GH200 will unlock for our customers.”
Get started with your GH200 Apache Spark migration
Enterprises can take advantage of the RAPIDS Accelerator for Apache Spark to seamlessly migrate Apache Spark workloads to NVIDIA GH200. RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing by combining the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. Enterprises can run existing Apache Spark applications on GPUs with no code change by launching Spark with the RAPIDS Accelerator for Apache Spark plug-in jar.
Today, GH200 powers nine supercomputers around the world, is offered by a wide array of system makers, and can be accessed on demand at cloud providers such as Vultr, Lambda, and CoreWeave. You can also test GH200 through NVIDIA LaunchPad. To learn more about Apache Spark acceleration on GH200, check out the GTC 2024 session Accelerate ETL and Machine Learning in Apache Spark on demand.