Accelerating Sequential Python User-Defined Functions with RAPIDS on GPUs for 100X Speedups

Motivation

Custom “row-by-row” processing logic (sometimes called sequential User-Defined Functions) is prevalent in ETL workflows. The sequential nature of UDFs makes parallelization on GPUs tricky. This blog post covers how to implement the same UDF logic using RAPIDS to parallelize computation on GPUs and unlock 100x speedups.

Introduction

Typically, sequential UDFs revolve around records with the same grouping key. For example, analyzing a user’s clicks per session on a website where a session ends after 30 minutes of inactivity. In Apache Spark, you’d first cluster on user key, and then segregate user sessions using a UDF like in the example below. This UDF uses multiple CPU threads to run sequential functions in parallel.

	# requires data to be in format <user_sk>\t<timestamp>, clustered by user_sk and sorted by <timestamp> ascending
	line = ''
	current_uid = ''
	last_click_time = -1
	perUser_sessionID_counter = 1
	timeout = 60*60

	for line in click_rows:
	user_sk, tstamp_str = line.strip().split("\t")
	tstamp = long(tstamp_str)

	# reset if next partition beginns
	if current_uid != user_sk:
	current_uid = user_sk
	perUser_sessionID_counter = 1
	last_click_time = tstamp

	# time between clicks exceeds session timeout?
	if tstamp – last_click_time > timeout:
	perUser_sessionID_counter += 1

	last_click_time = tstamp
	print(f"{user_sk}, {str(perUser_sessionID_counter)}")

view raw sessionize_logic.py hosted with ❤ by GitHub

Running on the GPU with Numba

Dask-CUDA clusters use a single process and thread per GPU, so running a single sequential function won’t work for this model. To run serial functions on the hundreds of CUDA cores available in GPUs, you can take advantage of the following properties:

The function can be applied independently across keys.
The grouping key has high cardinality (there are a high number of unique keys)

Let’s look at how you can use Numba to transform our sessionization UDF for use on GPUs. Let’s break the algorithm into two parts.

Part 1. Session change flag

With this function, we create session boundaries. This function is embarrassingly parallel so that we can launch threads for each row of the dataframe.

	# UDF-1
	# make a array for session change flag
	# for all rows x
	# set the flag if user_sk[x]!=user_sk[x-1] or tstamp_inSec[x] – tstamp_inSec[x-1] > 60 minutes
	@cuda.jit
	def make_session_change_flag_kernel(
	user_sk, tstamp_inSec, session_change_flag, time_out
	):
	gid = cuda.grid(1)
	if 0 < gid < len(wcs_user_sk):
	if (
	user_sk[gid] != user_sk[gid – 1]
	or tstamp_inSec[gid] – tstamp_inSec[gid – 1] > time_out
	):
	session_change_flag[gid] = np.int32(1)
	else:
	session_change_flag[gid] = np.int32(0)
	else:
	session_change_flag[gid] = np.int32(1)

view raw udf_1_session_change_flag.py hosted with ❤ by GitHub

Part 2. Serial logic on a session-level

For session analysis, we have to set a unique session flag across all the rows that belong to that session for each user. This function is serial at the session level, but as we have many sessions (approx 18 M) in each dataframe, we can take advantage of GPU parallelism by launching threads for each session.

	# UDF-2:
	#if session change flag
	#. set session id to x
	#. set all other clicks to x until next session_change_flag
	#else
	#. do nothing

	@cuda.jit
	def populate_session_ids_kernel(session_boundary):
	gid = cuda.grid(1)
	# don't loop if we're off the edge
	if gid < len(session_boundary):
	# if this is a session boundary…
	if session_boundary[gid] == 1:
	# this thread marks the start of a session
	session_boundary[gid] = gid
	look_ahead = 1

	# check elements 'forward' of this one
	# until a new session boundary is found
	while session_boundary[gid + look_ahead] == 0:
	session_boundary[gid + look_ahead] = gid
	look_ahead += 1

	# don't segfault if I'm the last thread
	if gid + look_ahead == len(session_boundary) – 1:
	break

view raw add_session_id.py hosted with ❤ by GitHub

Using these functions with dask-cuDF Dataframe

We invoke the above functions on each partition of a dask-cudf dataframe using a map_partition call. The main logic of the code above is still similar to pure python while providing a lot of speedups. A similar serial function in pure python takes about `17.7s` vs. `14.3 ms` on GPUs. That’s a whopping 100x speedup.

	def get_session_id(df, time_out):
	"""
	This function creates a session id column for each click
	The session id grows in incremeant for each user's susbequent session
	Session boundry is defined by the time_out
	"""
	# Preallocate destination column for Numba
	df["session_change_flag"] = cp.zeros(len(df), dtype="int32")

	wcs_user_sk = df["wcs_user_sk"]._column.data_array_view
	tstamp_inSec = df["tstamp_inSec"]._column.data_array_view
	session_change_flag = df["session_change_flag"]._column.data_array_view

	## configure kernel based on number of tasks
	conf_session_change_flag_kernel = make_session_change_flag_kernel.forall(
	len(wcs_user_sk)
	)
	conf_populate_session_ids_kernel = populate_session_ids_kernel.forall(
	len(wcs_user_sk)
	)

	## Determine session boundries
	conf_session_change_flag_kernel(
	wcs_user_sk, tstamp_inSec, session_change_flag, time_out
	)

	## Populate session ids
	conf_populate_session_ids_kernel(session_change_flag)

	df = df.rename(columns={"session_change_flag": "session_id"})
	cuda.synchronize()
	return df


	# dask cudf dataframe call to sessionize
	# after repartitioning along a user key
	df = df.map_partitions(create_session_id, 60 * 60)

view raw sessionize_dask_cudf.py hosted with ❤ by GitHub

Conclusion

When working with serial user-defined functions that operate on many small groups, transforming the UDF using numba, as shown above, will maximize performance on GPUs. Thus, with the power of numba, Dask, and RAPIDS, you can now supercharge the map-reduce UDF ETL jobs to run at the speed of light.

Want to start using Dask and RAPIDS to analyze big data at the speed of light? Check out the RAPIDS Getting Started webpage, with links to help you download pre-built Docker containers or install directly via Conda.