RAPIDS Accelerator for Apache Spark v21.06 Release

Introduction

RAPIDS Accelerator for Apache Spark v21.06 is here! You may notice right away that we’ve had a huge leap in version number since we announced our last release. Don’t worry, you haven’t missed anything. RAPIDS Accelerator is built on cuDF, part of the RAPIDS ecosystem. RAPIDS transitioned to calendar versioning (CalVer) in the last release, and, from now on, our releases will follow the same convention.

We like CalVer because it is simple, and this new release is all about making your data science work simple as well. Of course, we’ve made changes to accommodate new versions of Apache Spark, but we’ve also simplified installation. We’ve added a profiling tool so it’s easier to identify the best workloads to run on the GPU.

There is a host of new functions to make life working with data easier. And we’ve expanded our community: if you are a Cloudera or an Azure user, GPU acceleration is more straightforward than ever. Let’s get into the details.

Updates

We added support for Apache Spark version 3.1.2 and Databricks 8.2ML GPU runtime.

To simplify the installation of the plug-in, we now have a single RAPIDS cuDF jar that works with all versions of NVIDIA CUDA 11.x. The jar was tested with CUDA 11.0 and 11.2 and relies on CUDA enhanced compatibility to work with any version of CUDA 11.

Profling and qualification tool

The RAPIDS Accelerator for Apache Spark now has an early release of the tool to analyze Spark logs to find jobs that are a good fit for GPU acceleration, as well as, profile jobs running with the plug-in. If applied to a single application, the tool looks at CPU event logs. In the case of multiple applications, the tool filters out individual application event logs and then provides information about the percentage of the runtime spent in SQL/Dataframe operations. It also calculates a breakdown of runtime spent on IO compared to Computation compared to Shuffle for these operations.

The profiling tool gives users information to help them debug their jobs. Key information it provides is Spark version, Spark properties (including those provided by the plug-in). Hadoop properties, lists of failed jobs and failed executors, query duration comparisons for all the queries in the input event logs, and the data format and storage type used.

As the tool is still in the early stages, we are excited to hear back from users on how we can improve the tool and their experience.

New functionality

This new release has additional functionality for arrays and structs. We can now sort on struct keys, have structs with map values, and cache structs. We can support concatenation of array columns, creation of 2D arrays, partition on arrays, and more. We’ve also expanded windowing lead/lag to support arrays. And range windows now support non-timestamp order by expressions. One of the important scaling capabilities is enabling large joins (for example, joins with large skew) to spill out of GPU memory and complete successfully. In addition, GPUDIRECT Storage has been integrated into the plug-in to enable Direct Memory Access (DMA) between storage and GPUs with increased bandwidth and lower latency to reduce I/O overhead for spilling and improve performance. For a detailed list of new features, please refer to the 21.06.0 changelog on the GitHub site.

Growing community

NVIDIA and Cloudera have continued to expand their partnership. Cloudera Data Platform (CDP) integration with RAPIDS Accelerator will be generally available on CDP PVC Base 7.1.6 release from July 15. With this integration, on-prem (private cloud) customers can accelerate their ETL workloads with NVIDIA-Certified systems and Cloudera Data Platform. See the press release from Cloudera here and check out our joint webinar here.

We’re also excited to let you know that NVIDIA and Microsoft have teamed to bring RAPIDS Accelerator to Azure Synapse. Support for Accelerator is now built-in, and customers can use NVIDIA GPUs for Apache Spark applications with no-code change and with an experience identical to a CPU cluster.

Coming soon

RAPIDS Accelerator for Apache Spark is a year old and growing fast. Our next release will expand read/write support for Parquet and ORC data formats, add operator support for Struct, Map, and List data types, increase stability and deliver an initial implementation of Out of Core Group by operations. Look forward to our next release in August, and in the meantime, follow all our developments on GitHub and blogs.