DEVELOPER BLOG

AI / Deep Learning | Data Science |

Python Pandas Tutorial – Beginner’s Guide to GPU Accelerated DataFrames for Pandas Users

This tutorial is the second part of a series of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process signal and system log, or use SQL language via BlazingSQL to process data.

As the data science field grew immensely over the last decade, the appetite among data scientists for compute power to handle the data volume has been steadily increasing as well. Around 15 to 20 years ago the most common choices to process data were limited to SAS and SQL based solutions (like SQL Server, Teradata, PostGRE or MySQL), R if you were heavy on statistical methods, or Python if you wanted the versatility of a general-purpose language at the expense of some statistical features. Over the years many solutions were introduced to handle the ever-growing stream of data: Hadoop was released in 2006 and over the years it gained popularity only to be dethroned by Apache Spark as a go-to tool for processing data in a distributed way.

However, using Spark brought new challenges: for people familiar with pandas and other tools from the PyData ecosystem it meant learning a new API and for businesses it meant rewriting their code base to run in a distributed environment of Spark. Dask bridged this gap by adding the distributed support to already existing PyData objects like pandas DataFrames or NumPy arrays, and made it that much easier to harness the full power of the CPU or of a distributed cluster without extensive code rewrites. However, processing large amounts of data in an interactive manner has (and still is) beyond reach for reasonably sized and priced CPU clusters.

RAPIDS changed that landscape in late 2018. Paired with Dask, RAPIDS taps into enormous parallelism of NVIDIA GPUs to process the data. However, unlike Apache Spark, it does not introduce new API but provides a very familiar programming interface of tools found in the PyData ecosystem like pandas, scikit-learn, NetworkX and more.

The first post was a python pandas tutorial where we introduced RAPIDS cuDF, the RAPIDS CUDA DataFrame library for processing large amounts of data on an NVIDIA GPU.

In this tutorial, we will discuss how cuDF is almost an in-place replacement for pandas. To aid with this we also published a cheat sheet that can be downloaded here: cuDF4pandas-cheatsheet, and an interactive notebook with cuDF and pandas API calls showcased here.

Popular interface

If you work with data using Python you have quite likely been using pandas or NumPy (since pandas builds on top of NumPy). It is hard to overstate the great value these two libraries provide to the PyData ecosystem. pandas has been the bread and butter for almost all data manipulation and massaging and has been seen in countless notebooks published on GitHub, Kaggle, Towards Data Science and more. Chances are, if you are looking for a solution to your data problem using Python, you are likely to be shown examples that employ pandas.

RAPIDS cuDF, almost entirely, replicates the same API and functionality. While not 100% feature-equivalent with pandas, the team at NVIDIA as well the external contributors work continually on bridging the feature-parity gap.

Moving between these two frameworks is extremely easy. Consider a very simple pandas script below that reads the data, calculates descriptive statistics for a DataFrame, and then performs a simple data aggregation.

# pandas
import pandas as pd
df = pd.read_csv('df.csv')
(
   df
    .groupby(by='category')
    .agg({'num': 'sum', 'char': 'count'})
    .reset_index()
)

Let’s peek inside the df DataFrame,

df.head(10)

We get the following output:

shows the top 10 rows of the df DataFrame. The table has 3 columns: 'category' that holds a categorical variable, 'num' that holds a numeric value and 'char' that showcases how to store a character data in a column. pandas allows to store strings as well.
Table 1. First 10 rows of the df DataFrame.

The aggregation step will, accurately, provide the final result,

Table 2 shows the aggregated results of the df dataframe. We learn that there are 3 rows with category equal to 'B', 1 row that has 'C' category, and 6 rows with 'D' category. In addition, the sum of the numeric column 'num' per category are 88 for category 'B', 42 for category 'C' and 167 for category 'D'.
Table 2. Aggregated results of the df DataFrame.

With RAPIDS, all you have to do to the above code to run on a GPU and to enjoy the interactive querying of data, is to change the import statement.

# cudf
import cudf
df = cudf.read_csv('../results/pandas_df_with_index.csv')
(
    df
    .groupby(by='category')
    .agg({'num': 'sum', 'char': 'count'})
    .reset_index()
)

We, literally, changed one import and are now running on a GPU! 

On a small dataset like this it will take milliseconds to run things with pandas and everybody is happy. How about processing a DataFrame with 100 million records? Let’s simulate a larger dataset. In this example we will use CuPy to speed things up a little.

import cupy as cp
choices = range(6)
probs = cp.random.rand(6)
s = sum(probs)
probs = [e / s for e in probs]
n = int(100e6)  ## 100M
selected = cp.random.choice(choices, n, p=probs)
nums = cp.random.randint(10, 1000, n)
chars = cp.random.randint(65,80, n)

The NumPy equivalent of the above code would be,

import numpy as np
choices = range(6)
probs = np.random.rand(6)
s = sum(probs)
probs = [e / s for e in probs]
n = int(100e6)  ## 100M
selected = np.random.choice(choices, n, p=probs)
nums = np.random.randint(10, 1000, n)
chars = np.random.randint(65,80, n)

Notice any differences? Yes, only the import statement! And time: the CuPy version runs in about 1.27 seconds on a Titan RTX while the NumPy version on a i5 CPU takes roughly 3.33 seconds.

We can now use the CuPy or NumPy arrays to create cuDF or pandas DataFrames.

import cudf
df = cudf.DataFrame({
    'category': selected
    , 'num': nums
    , 'char': chars
})
df['category'] = df['category'].astype('category')

and,

import pandas
df = pandas.DataFrame({
    'category': selected
    , 'num': nums
    , 'char': chars
})
df['category'] = pandas_df['category'].astype('category')

Times to create these are negligible as both cuDF and pandas simply retrieve pointers to the created CuPy and NumPy arrays. Still, we have so far only changed the import statements. 

The aggregation code is the same as we used earlier with no changes between cuDF and pandas DataFrames (ain’t that neat!) However, the execution times are quite different: it took on average 68.9 ms +/- 3.8 ms (7 runs, 10 loops each) for the cuDF code to finish while the pandas code took, on average, 1.37s +/- 1.25 ms (7 runs, 10 loops each). This equates to roughly 20x speedup! And did I mention no code changes?! 

Custom kernels

In some instances minor code adaptations when moving from pandas to cuDF are required when it comes to custom functions used to transform data. 

RAPIDS cuDF, being a GPU library built on top of NVIDIA CUDA, cannot take a regular Python code and simply run it on a GPU. Under the hood cuDF uses Numba to convert and compile the Python code into a CUDA kernel. Scared already? Don’t be! No direct knowledge of CUDA is necessary to run your custom transform functions using cuDF.

Consider the following simple example in pandas that calculates a regression line given two features,

# pandas
def pandas_regression(a, b, A_coeff, B_coeff, constant):
    return A_coeff * a + B_coeff * b + constant


pandas_df['output'] = pandas_df.apply( lambda row: pandas_regression( row['num'] , row['float'] , A_coeff=0.21 , B_coeff=-2.82 , constant=3.43 ), axis=1)

Pretty straightforward. To achieve the same outcome in cuDF it is a little bit different but still very readable and easily convertible.

# cudf
def cudf_regression(a, b, output, A_coeff, B_coeff, constant):
    for i, (aa, bb) in enumerate(zip(a,b)):
        output[i] = A_coeff * aa + B_coeff * bb + constant 
  

cudf_df.apply_rows( cudf_regression , incols = {'num': 'a', 'float': 'b'} , outcols = {'output': np.float64} , kwargs = {'A_coeff': 0.21, 'B_coeff': -2.82, 'constant': 3.43} )

Numba will take the cudf_regression function and compile it to CUDA kernel. The apply_rows call is equivalent to the apply call in pandas with the axis parameter set to 1 i.e. iterate over rows rather than columns. Note, that in cuDF you also need to specify the data type of the output column so Numba can provide the correct return type signature to CUDA kernel. Despite these differences, though, the code is still a very close equivalent of the pandas version, differing mostly in the API call: the regression line is calculated in an almost the same way,

return A_coeff * a + B_coeff * b + constant

vs.

output[i] = A_coeff * aa + B_coeff * bb + constant

And just like that your pure Python code can be executed on a GPU. 

If you use operations on Strings, DateTimes or categorical columns, check the notebooks we prepared to showcase the pandas vs cuDF API calls available and download the cuDF4pandas cheat sheet!