DEVELOPER BLOG

AI / Deep Learning | Data Science |

Beginner’s Guide to Querying Data Using SQL on GPUs in Python

This beginner’s guide, is the third installment of the series of articles on the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process signal and system log, or use SQL language via BlazingSQL to process data.

Historically speaking, processing large amounts of structured data has been the domain of relational databases. Databases, consisting of tables that can be joined together or aggregated, have been the staple of data processing and are fundamental to the existence of many companies. And just like databases are almost synonymous with storing and processing data, Structured Query Language (SQL) has been synonymous with interacting with these tables.

Multiple flavors of SQL language exist but at the core, they are quite similar. The most common and standardized, the ANSI SQL, is a de facto standard that other tools derive from with some changes. SQL Server, Teradata SQL, or Oracle databases have been known in the enterprise setting while Open Source Software (OSS) solutions like MySQL (or it’s fork MariaBD) and Postgres have garnered a large following and widespread adoption over the years among enthusiasts and companies. And while SQL language adopted by each solution might vary with its own extensions and features, fundamentally they derive from ANSI SQL. And so does BlazingSQL.

The previous tutorials in the series showcased other areas:

  • In the first post, python pandas tutorial  we introduced cuDF, the RAPIDS DataFrame framework for processing large amounts of data on an NVIDIA GPU. 

In this post, we introduce Blazing SQL, a SQL engine that runs on NVIDIA GPUs. To help with getting familiar with BlazingSQL, we also published a cheat sheet that can be downloaded here BlazingSQL-cheat sheet, and an interactive notebook with all the current functionality of BlazingSQL showcased here..

The language

The structured querying language originated at IBM and was initially called SEQUEL: Structured English QUEry Language. The name was later shorted to become SQL (pronounced literally as S-Q-L) but many still pronounce it /ˈsiːkwəl/ (or see-quell) after the original name. 

The original naming was not accidental: just like many popular programming languages, SQL makes its code easily readable and understandable by an English speaking programmer. The language is based on relational algebra that defines tables in terms of algebraic structures and relations, and explains how to run queries on them.

SQL is a set-based declarative language. It was first standardized in 1986 and has since been updated nine times, last in 1999. Fundamentally, it supports all the atomic data types like fixed-length NUMERIC or DECIMAL, integer and floating types, booleans, time (date, timestamp, time and interval), strings, and blobs but many vendor derivatives introduced additional data types like images or audio (via blobs), or spatial to name a few. However, the adherence to the ANSI SQL standard is not enforced and varies between different implementations thus making the SQL language hard to port between databases.


BlazingSQL is not a database but rather a SQL engine that understands the SQL language and can apply it to process data using GPUs. BlazingSQL uses Apache Calcite to parse the query that produces the relational algebra which then gets executed as CUDA kernels on a GPU. BlazingSQL builds on RAPIDS and can query cuDF DataFrames (or dask_cudf DataFrames) stored in the GPU memory, or Parquet, CSV/TSV, JSON, and ORC files (and any format that RAPIDS will eventually support) stored both locally and remotely.

Querying data

As the name suggests, SQL was built with the purpose of querying data. Doing so is, thus, trivial:

SELECT * FROM table

The query selects all the columns and all the rows from the table.

Using BlazingSQL is equally simple. However, before we can start querying our data we need to create a BlazingContext that is an entry point for all things BlazingSQL. The easiest way to do so is to start with default parameters (see the cheat sheet for a more detailed example of starting BlazingContext):

from blazingsql import BlazingContext
bc = BlazingContext()

Now we can create a table so we can query it. The table can be either a cuDF DataFrame, Dask cuDF DataFrame, or a local or remote file. To create a table from cuDF simply pass the DataFrame along with the name of the table:

bc.create_table('my_table', my_cudf_df)

If the data we want to use is on our disk (or on the local network), the call is almost the same but instead of the DataFrame pass the path to the file (or files if the dataset is partitioned)

bc.create_table('my_table', 'path_to_my_file.parquet')

However, if we want to access the data stored in an Amazon S3 bucket, we need to first register the S3 endpoint with the BlazingContext:

bc.s3(
    'bucket'
    , bucket_name = 'bucket'
)

Then, create the table as follows:

bc.create_table('my_remote_table', 's3://bucket/path_to_my_file.csv', header=1)

Having created a table (or tables), we can now start using the engine to do something interesting. Of course, we can do the same thing as in the example above:

df = bc.sql('SELECT * FROM my_table')

The return object is a cuDf DataFrame or a dask_cudf DataFrame if we attached the BlazingContext to a running Dask cluster.

But, since BlazingSQL supports ANSI SQL and much of its functionality, we can do much more than that. For example, the following would be a perfectly normal query:

query = '''
    SELECT A.col_1 AS col_A1
        , B.col_1 AS col_B1
        , SUM(A.col_2) AS col_A2_sum
        , AVG(B.col_2) AS col_B2_avg
    FROM (
        SELECT col_1
            , col_2
        FROM my_table
        WHERE col_3 > 0.93
    ) AS A
    INNER JOIN my_table_2 AS B
        ON A.col_1 = B.col_3
    GROUP BY A.col_1
        , B.col_1
    ORDER BY A.col_1
'''
bc.sql(query)

This query is just slightly more complicated but it is still quite readable: we have sub-query A that pulls data from the my_table table and subsets it to only those rows where col_3 is greater than 0.93.

Next, we have an inner join that:

  • Selects the intersection between these two datasets (one being created on the fly and the my_table_2 table).
  • Aggregates that dataset on col_1 from the dataset aliased by A (the AS A call) and col_1 from the dataset aliased as B, calculating the sum of elements of col_2 from A and col_2 from B.

Finally, let’s not forget that we did sort this data (using ORDER BY call) by col_1 from table A. Now, we have almost limitless possibilities:

  • We can use the returned DataFrame to create another table that can be an input to another query,
  • We can use RAPIDS API to further transform the data or pass it cuML for model building, or we can save the result into a file, just to name a few.

Since BlazingSQL runs on a GPU and GPUs tend to have limited memory, we can drop some of the table to release some of that memory so we can create another table: this is achieved with the following call:

bc.drop_table('my_table')

That’s it!

We have now run through a simple example to showcase how easy it is to use BlazingSQL. If you want to try other functionality of BlazingSQL, further explore the SQL language supported, or simply run through the above examples, go to app.blazingsql.com. To further help, download the handy BlazingSQL cheatsheet