Prototyping Faster with the Newest UDF Enhancements in the NVIDIA cuDF API

Over the past few releases, the NVIDIA cuDF team has added several new features to user-defined functions (UDFs) that can streamline the development process while improving overall performance. In this post, I walk through the new UDF enhancements and show how you can take advantage of them within your own applications:

The cuDF Series.apply API and how to use it
The cuDF DataFrame.apply API and how to write a UDF in terms of “rows”
Enhanced support for missing data using both apply APIs
A real-world use case example with timing
Practical considerations, limitations, and future plans

apply API for cuDF series

If you’re not familiar with pandas, series apply is the main entry point used for mapping an arbitrary Python function onto a single series of data. For example, you might want to convert temperature in Celsius to Fahrenheit using a formula already written as a Python function.

Here is a quick refresher followed by the output of running this code:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw ctof_pandas.ipynb hosted with ❤ by GitHub

Technically, you can write any valid Python code within the function f and pandas runs the function in a loop over the series. This makes apply extremely flexible in the context of pandas, as any UDF can be successfully applied as long as it can successfully handle all of the input data—even UDFs that rely on external libraries or ones that expect or return arbitrary Python objects.

But, this flexibility comes at the cost of performance. Running a Python function in a long loop is not known for being an efficient strategy for a variety of reasons (for example, Python being interpreted from the outset). As a result, this performance constraint can be frustrating if your UDFs are simpler, such as those composed of purely mathematical operations on scalar values.

Luckily, these use cases are what cuDF was built for. Recent cuDF improvements within the scope of UDFs have motivated the introduction of the equivalent apply API:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw ctof_cudf.ipynb hosted with ❤ by GitHub

If you are familiar with pandas, you can produce the same results as you did using pandas for numeric values. The only notable difference is that the resulting data is always a cuDF dtype and not an object, which is usually the case in pandas.

The function f can contain any Python UDF that is composed of pure math or Python operations. cuDF deduces an appropriate return dtype based on the inspection of the function through Numba and compiles and runs an equivalent function on the GPU.

Functions can also be written to accept an arbitrary number of scalar arguments. In the following code example, you can see that args= is supported:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw cudf_args.ipynb hosted with ❤ by GitHub

While there are other ways of accomplishing the same goal in cuDF using custom kernels and other methods, this method of writing UDFs helps to abstract the GPU away from the process, which can cut down on development time for data scientists working on fast-paced, real-world projects.

So far, I’ve covered only the case of Series-based data. That is, I’ve shown you how to write a UDF with a single input and output. Many use cases require multi-column input, however, and this requires slightly different thinking.

DataFrame UDFs and thinking in terms of rows

UDFs that expect multiple columns as input and produce a single column as output are the set of functions supported by the pandas DataFrame apply API.

In these cases, the first function argument represents a row of data rather than just one value from a single input column. By row, I mean some kind of data structure that is keyable to obtain values, where the keys are the column names and the values are the scalars corresponding to the values of those columns in that row. It is conceptually what you get when you use iloc in pandas:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw pd_iloc.ipynb hosted with ❤ by GitHub

The following code example shows how you would write and use a UDF in pandas that consumes this kind of row object:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw pd_row_udf.ipynb hosted with ❤ by GitHub

cuDF now enables you to do the exact thing without rewriting your UDF.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw cudf_row_udf.ipynb hosted with ❤ by GitHub

When applying these functions, it is important to note that even though the cuDF API expects you to write the functions in terms of rows, no rows are actually involved when it comes to the execution of this function.

cuDF avoids the use of a for-loop and instead executes CUDA kernels that “pretend” rows of data exist. With a little magic, Numba knows how to write a proper kernel to get the same result as pandas. Because there is no loop, you should see higher performance when executing functions through this API.

Support for missing values using the series and DataFrame apply

Historically, UDFs in cuDF have not provided full support for missing values. This is due to architectural choices inside cuDF that relate to the way cuDF records which elements are null, specifically its use of a null mask to conserve memory.

The looping design of pandas apply APIs just works if the data contains null values. If a null is encountered in the data, the UDF receives the special value pd.NA. As a result, if the special value does not trigger an error, the execution proceeds as normal. However, cuDF does not work this way, and it requires a little extra machinery to support the same functionality. If you use the cuDF apply API, you should find that your UDFs treat null values in a natural manner:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw cudf_row_udf_nulls.ipynb hosted with ❤ by GitHub

You can even condition on the cudf.NA singleton and get the expected answer, or return it directly from the function:

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw cudf_row_udf_complex.ipynb hosted with ❤ by GitHub

Evidently, the same thing is true here as is the case with rows: cuDF does not actually run the Python function as pandas does. Instead, it uses more Numba magic to translate this class of functions into an equivalent CUDA kernel and then returns the result of that instead.

In the next section, I look at a real-world example and perform some rough timing.

Real-world example using apply

Consider this scenario: An online streaming service is investigating which segments of its subscribers tend to hold their subscriptions the longest. Additionally, leadership has requested a specific segmentation scheme that breaks subscribers up by age:

18–19
20–29
30–39
40–49
50–59
60–69
69+

The provided data only has two fields: age and days_subscribed.

Here’s how a UDF can solve the problem. First, write the row-wise custom function that applies the grouping. Next, take the results, group by the group ID, and average over the number of renewals.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw cudf_usecase_realworld.ipynb hosted with ❤ by GitHub

In this code example, the data is randomly generated so your mileage may vary on the actual answer. However, it demonstrates the process. Timing the UDF section of the code involves creating a variable pdf through pdf = df.to_pandas , and accomplishing a rough comparison using IPython:

%timeit df.apply(f, axis=1)

# 1.64 ms ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit pdf.apply(f, axis=1)
# 19.2 s ± 63.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although this is not an official benchmark, the CUDA kernel is over four orders of magnitude faster on average in this particular case, which was run on a 32 GB V100 GPU.

Practical considerations, limitations, and the future

While these cuDF improvements represent significantly broader capabilities than previous iterations, there is always room to grow. Here is a list of key items to consider when writing UDFs for apply in cuDF:

JIT compilation. The first time a function is executed against a cuDF object, you encounter overhead effects of compiling the correct CUDA kernel. Subsequent uses of the function do not require recompilation, unless the dtypes of the target dataset change.
dtype support. So far, only numeric dtypes are supported in apply. However, support for additional types is on the roadmap, starting with strings.
External libraries. A common pattern is performing data prep in pandas and then using an external library for processing inside the UDF for each row. Because you cannot map external code onto the GPU arbitrarily, this is not currently supported.

Summary

UDFs are an easy way of solving particular problems quickly. They help you think in terms of a single datum when designing the logic of your pipeline. With these new cuDF UDF enhancements, the aim is to expedite the development of workflows involving cuDF and allow you to quickly prototype solutions, as well as reuse existing business logic. In addition, null support lets you be explicit about how to handle missing values without needing extra processing steps.

As a reminder, UDFs are an area of active development in cuDF and updates are ongoing. If you choose to try these new UDF enhancements out, as always, I’d love to hear about your experience in the comments section.

Update 11/20/2023: RAPIDS cuDF now comes with a pandas accelerator mode that allows you to run existing pandas workflow on GPUs with up to 150x speed-up requiring zero code change while maintaining compatibility with third-party libraries. The code in this blog still functions as expected, but we recommend using the pandas accelerator mode for seamless experience. Learn more about the new release in this TechBlog post.