Technical Walkthrough

GPU-Accelerated JSON Data Processing with RAPIDS

Discuss (0)

JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications. While the JSON format is human-readable, it is complex to process with data science and data engineering tools.

To bridge that gap, RAPIDS cuDF provides a GPU-accelerated JSON reader (cudf.read_json) that is efficient and robust for many JSON data structures. The JSON format specifies a general-purpose, tree-like data structure, and cuDF implements algorithms to easily transform the JSON tree into columnar data.

cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data in Python. When JSON data is structured as columnar data, it gains access to the powerful cuDF DataFrame API. We are excited to open up the possibilities of GPU acceleration to more data formats, projects, and modeling workflows with this reader.

This post highlights supported JSON data options: records-oriented JSON and JSON Lines. Following are a few examples of cuDF reader options to process JSON Lines files with byte ranges or multiple sources. Finally, you learn how to use the tools in cuDF to flatten list and struct types in cuDF, along with how to apply these tools to assemble DataFrames from common JSON patterns.

Reading JSON data in cuDF

By default, the cuDF JSON reader expects input data using the records orientation. Records-oriented JSON data is composed of an array of objects at the root level, and each object in the array corresponds to a row. The field names in the objects determine the column names of the table.

Another common variant for JSON data is JSON Lines, where JSON objects are separated by new line characters (\n), and each object corresponds to a row.

The following code example shows records-oriented JSON as well as JSON Lines data:

>>> j = '''[
... {"a": "v1", "b": 12},
... {"a": "v2", "b": 7},
... {"a": "v3", "b": 5}
... ]'''
>>> df_records = cudf.read_json(j)

>>> j = '\n'.join([
...     '{"a": "v1", "b": 12}',
...     '{"a": "v2", "b": 7}',
...     '{"a": "v3", "b": 5}'
... ])
>>> df_lines = cudf.read_json(j, lines=True)

>>> df_lines
    a   b
0  v1  12
1  v2   7
2  v3   5
>>> df_records.equals(df_lines)

The cuDF JSON reader is also compatible with nested JSON objects and arrays, which roughly map to struct and list data types in cuDF.

The following examples demonstrate the inputs and outputs for generating list and struct columns, and columns with data types that are arbitrary combinations of lists and structs.

# example with columns types:
# list<int> and struct<k:string>
>>> j = '''[
... {"list": [0, 1, 2], "struct": {"k": "v1"}}, 
... {"list": [3, 4, 5], "struct": {"k": "v2"}}
... ]'''
>>> df = cudf.read_json(j)
>>> df
        list       struct
0  [0, 1, 2]  {'k': 'v1'}
1  [3, 4, 5]  {'k': 'v2'}

# example with columns types: 
# list<struct<k:int>> and struct<k:list<int>, m:int>
>>> j = '\n'.join([
...     '{"a": [{"k": 0}], "b": {"k": [0, 1], "m": 5}}',
...     '{"a": [{"k": 1}, {"k": 2}], "b": {"k": [2, 3], "m": 6}}',
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df
                      a                      b
0            [{'k': 0}]  {'k': [0, 1], 'm': 5}
1  [{'k': 1}, {'k': 2}]  {'k': [2, 3], 'm': 6}

Handling large and small JSON Lines files

For workloads based on JSON Lines data, cuDF includes reader options to assist with data processing: byte range support for large files and multi-source support for small files.

Byte range support

Some workflows, such as fraud detection and user behavior modeling, require processing large JSON Lines files that may exceed GPU memory capacity.

The JSON reader in cuDF supports a byte range argument that specifies a starting byte offset and size in bytes. The reader parses each record that begins within the byte range, and for this reason, byte ranges do not have to align with record boundaries.

In distributed workflows, byte ranges enable each worker to process a subset of the data. In filtering and aggregation, byte ranges enable a single worker to process the data in chunks.

To avoid skipping rows or reading duplicate rows, byte ranges should be adjacent, as shown in the following example.

>>> num_rows = 10
>>> j = '\n'.join([
...     '{"id":%s, "distance": %s, "unit": "m/s"}' % x \
...     for x in zip(range(num_rows), cupy.random.rand(num_rows))
... ])
>>> chunk_count = 4
>>> chunk_size = len(j) // chunk_count + 1
>>> data = []
>>> for x in range(chunk_count):
...     d = cudf.read_json(
...         j,        
...         lines=True, 
...         byte_range=(chunk_size * x, chunk_size)
...     )
...     data.append(d)    
>>> df = cudf.concat(data)

Multi-source support

By contrast, some workflows require processing many small JSON Lines files.

Rather than looping through the sources and concatenating the resulting DataFrames, the JSON reader in cuDF accepts a list of data sources. Then the raw inputs are efficiently processed as a single source.

The JSON reader in cuDF accepts sources as file paths, raw strings, or file-like objects, as well as lists of these sources.

>>> j1 = '{"id":0}\n{"id":1}\n'
>>> j2 = '{"id":2}\n{"id":3}\n'
>>> df = cudf.read_json([j1, j2], lines=True)

Unpacking list and struct data

After reading JSON data into a cuDF DataFrame with list and struct column types, the next step in many workflows is to extract or flatten the data into simple types.

For struct columns, one solution is extracting the data with the struct.explode accessor and joining the result to the parent DataFrame.

The following code example demonstrates how to extract data from a struct column.

>>> j = '\n'.join([
...     '{"x": "Tokyo", "y": {"country": "Japan", "iso2": "JP"}}',
...     '{"x": "Jakarta", "y": {"country": "Indonesia", "iso2": "ID"}}',
...     '{"x": "Shanghai", "y": {"country": "China", "iso2": "CN"}}'
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df = df.drop(columns='y').join(df['y'].struct.explode())
>>> df
          x    country iso2
0     Tokyo      Japan   JP
1   Jakarta  Indonesia   ID
2  Shanghai      China   CN

For list columns where the order of the elements is meaningful, the list.get accessor extracts the elements from specific positions. The resulting cudf.Series object can then be assigned to a new column in the DataFrame.

The following code example demonstrates how to extract the first and second elements from a list column.

>>> j = '\n'.join([
...     '{"name": "Peabody, MA", "coord": [42.53, -70.98]}',
...     '{"name": "Northampton, MA", "coord": [42.32, -72.66]}',
...     '{"name": "New Bedford, MA", "coord": [41.63, -70.93]}'
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df['latitude'] = df['coord'].list.get(0)
>>> df['longitude'] = df['coord'].list.get(1)
>>> df = df.drop(columns='coord')
>>> df
              name  latitude  longitude
0      Peabody, MA     42.53     -70.98
1  Northampton, MA     42.32     -72.66
2  New Bedford, MA     41.63     -70.93

Finally, for list columns with variable length, the explode method creates a new DataFrame with each list element as a row. Joining the exploded DataFrame on the parent DataFrame yields an output with all simple types.

The following example flattens a list column and joins it to the index and additional data from the parent DataFrame.

>>> j = '\n'.join([
...     '{"product": "socks", "ratings": [2, 3, 4]}',
...     '{"product": "shoes", "ratings": [5, 4, 5, 3]}',
...     '{"product": "shirts", "ratings": [3, 4]}'
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df = df.drop(columns='ratings').join(df['ratings'].explode())
>>> df
  product  ratings
0   socks        2
0   socks        4
0   socks        3
1   shoes        5
1   shoes        5
1   shoes        4
1   shoes        3
2  shirts        3
2  shirts        4

Building JSON data solutions with cuDF

At times, a workflow must process JSON data with an object root. cuDF provides tools to build solutions for this kind of data. To process JSON data with an object root, we recommend reading the data as a single JSON Line and then unpacking the resulting DataFrame.

The following example reads a JSON object as a single line and then extracts the “results” field into a new DataFrame.

>>> j = '''{
...     "metadata" : {"vehicle":"car"},
...     "results": [
...         {"id": 0, "distance": 1.2},
...         {"id": 1, "distance": 2.4},
...         {"id": 2, "distance": 1.7}
...     ]
... }'''

# first read the JSON object with lines=True
>>> df = cudf.read_json(j, lines=True)
>>> df
             metadata                                            records
0  {'vehicle': 'car'}  [{'id': 0, 'distance': 1.2}, {'id': 1, 'distan...

# then explode the 'records' column 
>>> df = df['records'].explode().struct.explode()
>>> df
   id  distance
0   0       1.2
1   1       2.4
2   2       1.7

Key takeaways

The cuDF JSON reader is designed to accelerate a wide range of JSON data workloads, including simple and complex types across both large files and small files.

This post demonstrates common uses of the cuDF JSON reader with records-oriented and JSON Lines data, as well as showcasing byte range and multi-source support. Now you can accelerate the way you work with JSON data and efficiently incorporate JSON data into your workflows.

Apply your knowledge

To get started with RAPIDS cuDF, we encourage you to run our notebook 10 minutes to cuDF by installing RAPIDS on Google Colab, where you can see common DataFrame algorithms and data input and output in action.

For more information about cuDF, see the cuDF documentation or the rapidsai/cudf GitHub repo. For easier testing and deployment, Docker containers are also available for releases and nightly builds.