Data Science

Accelerating JSON Processing on Apache Spark with GPUs

A diagram of how JSON data is processed.

Jan 29, 2025

By Matt Ahrens, Robert Evans, Nghia Truong and Chong Gao

Discuss (0)

AI-Generated Summary

Dislike

A Fortune 100 retail company used JSON to store inventory data, processing tens of terabytes of JSON data in a single Spark workload, and achieved a 4x speedup and 80% cost savings by using NVIDIA GPUs.
The initial GPU implementation was slowed down by thread divergence due to sparse fields in the JSON objects, but optimizations such as combining multiple queries in the same warp and sorting queries lexicographically improved performance by 3.6x.
Further optimization using a data-parallel tokenizer from the RAPIDS cuDF library is expected to provide an additional 50% improvement, resulting in an overall 5.6x speedup for JSON processing on NVIDIA GPUs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has been in existence since the early 2000s and came from the need for communication between web servers and browsers. The standard JSON format consists of key-value pairs that can include nested objects. JSON has grown in usage for storing web transaction information, and can contain values that could be very large – sometimes over 1 GB per record. At first, parsing and validating JSON is not a task that would be associated with GPU acceleration because the text format has irregularities in size and no default ordering. However, with the usage of JSON in many enterprise data applications, the need for acceleration has grown.

For a Fortune 100 retail company, the JSON format is leveraged to store essential inventory data. The JSON format allows for unstructured data related to product categorization and inventory. The processing of JSON for the clickstream data included large queries processing tens of terabytes of JSON data in a single Spark workload.

GPU acceleration in production at a retailer

The results were noticeable for the retailer when applied to their production workloads running on GPUs. The GPU runtime reduced from 16.7 hours to 3.8 hours, which is a 4x speedup and 80% cost savings on GPUs relative to a comparable CPU cluster in their production environment.

The nodes in the cluster are GCP n1-standard-16 instances with a single NVIDIA T4 GPU attached to each node.

The Spark get_json_object function

GPU processing for JSON has existed in the RAPIDS Accelerator for Apache Spark since the 22.02 release, but there have been challenges in accelerated processing. In working with the retailer, specific processing of JSON records using Spark’s get_json_object function required parsing JSON on the fly within a SQL query. The format of JSON allows for embedding objects with hierarchy, such as arrays:

The purpose of the get_json_object function is to extract an object from the JSON record string based on the path provided. Here is a simple example SQL query where the get_json_object function is used to extract nested elements:

SELECT get_json_object('[{"a":"b1"}, {"a":"b2"}]', '$[*].a')
["b1","b2"]

In real world use cases, the function will allow for selecting nested objects inside a JSON record that are relevant for additional processing in an ETL pipeline.

The challenge of large strings in JSON

In the retailer workload, with a moderately wide output table, a single SQL query statement could result in the get_json_object function being called up to 50 times. The strings themselves were long, multiple KiB, on average. When you add in the frequency of calls along with the long JSON strings, there could be significant memory pressure put on the GPU when processing the data. This is especially true for the GPUs L1 cache. The initial implementation on the GPU would parallelize processing on the GPU per record to accelerate processing with a record-per-thread default. This resulted in the L1 cache trying to hold multiple records and we ended up thrashing the cache. Additionally, if threads diverge within a thread block, the processing time slows down because the thread block will pause other threads when it needs to process a divergent thread. The original results for a test 30 TB workload was 16 hours runtime on a CPU cluster, and 16.7 hours runtime on a comparable GPU cluster.

In diving into the retailer’s query and dataset, it was observed the data contained sparse fields inside the JSON objects. This means the probability a field would appear in a given record was very low; specifically, over 85% of the fields showed up in less than 0.01% of the records. Because of the sparseness, threads would be very likely to diverge early on in processing. The ideal scenario is for a given set of threads in the same warp to operate on similar data that is processed.

From the CUDA programming guide:

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

The initial slowness on GPU for processing the frequent get_json_object calls in the retailer queries was confirmed to be due to thread divergence and so the work to optimize this type of processing began.

Improving JSON processing on GPUs

To solve the problem of optimizing the JSON processing with large strings specifically with sparse data, we had to improve the way data was processed inside a warp to increase the probability of similar data being processed by the threads. To progress in performance, we traversed a path of successive optimizations to help work towards an accelerated workload for the retailer.

To provide a test environment to validate our optimizations, we executed single-node benchmarks to demonstrate the effect of the optimization efforts. The environment for the local benchmarks includes AMD Ryzen Threadripper PRO 5975WX with 32-Cores (CPU) and NVIDIA RTX A6000 48GB (GPU). The data used for our benchmarks is five columns and 200,000 rows of generated JSON data based on an approximation of the retailer’s JSON data. The benchmark data is about 9.2 GB uncompressed and 6.4 GB when stored as Parquet with snappy compression. Note that the compression ratio is very low because much of the data is randomly generated. The number of get_json_object calls varies from one to over 50 per column. Both the data and the paths are complex having a nesting level over ten and using array indexes and wildcards.

On a single-node with an NVIDIA GPU which represented a GPU the retailer was using, the benchmarks with the generated JSON data with large strings grew from initially being slightly faster than CPU processing, to over 5x faster after a series of optimizations.

The first technique applied was combining multiple queries for the same data in the same warp. If a large record was being queried across fields in different threads, we could intentionally group them in the same warp to balance thread divergence and reduce the cache pressure. This helped improve our local benchmarks by 3.2x so we already had a big win.

However, we noticed thread divergence could still occur when subsequent queries of the same JSON path happened in the SQL query across warps. That led us to sort the queries lexigraphically. This helped reduce the thread divergence even more because the queries had a higher probability of similarity within the same warp. Once that optimization was implemented, the local benchmarks improved by an additional 10% for a 3.6x speedup.

Lastly, we have started down the path of leveraging a more data-parallel tokenizer from the RAPIDS cuDF library to move away from single character parsing to move diversified string parsing when processing JSON objects. Benchmarks are showing this optimization to be another 50% improvement for an overall 5.6x speedup. The implementation will be released this year.

Key takeaways

Processing large amounts of string data can be a challenge for the GPU, so special optimization is required. The RAPIDS Accelerator for Apache Spark along with cuDF has enhanced JSON processing for improved speedups on GPUs.

Getting started with Apache Spark on GPUs

Enterprises can take advantage of the RAPIDS Accelerator for Apache Spark to seamlessly transition existing Spark workloads to NVIDIA GPUs with zero code change. RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing by combining the power of cuDF and the scale of the Spark distributed computing framework.

Get hands-on with JSON processing and the RAPIDS Accelerator for Apache Spark with this Colab notebook, and check out this upcoming GTC 2025 session to learn more.

Future work

Additional optimizations for string processing on GPUs is planned to leverage similar techniques that were used in accelerating JSON to more expressions and functionality.

To follow along or to help contribute, check out the open-source work for RAPIDS Accelerator for Apache Spark and cuDF.

Discuss (0)

About the Authors

About Matt Ahrens
Matt Ahrens is a director of engineering on the Accelerated Data Platforms group at NVIDIA. Prior to that, he worked for Yahoo for 15 years, leading a team of software engineers and data scientists focused on building large-scale data applications and insights products. Matt graduated from the University of Illinois with a bachelor's degree in mathematics and computer science.

View all posts by Matt Ahrens

About Robert Evans
Robert Evans is a distinguished software engineer at NVIDIA. He is a PMC member of Apache Hadoop, Spark, Storm, and Tez. Prior to NVIDIA, he worked for Yahoo on the Big Data Platform team on Apache Spark, Hadoop, Storm, and Tez. He also worked on enabling the GNU Linux operating system on ARM processors for mobile devices. Robert holds BS degrees in Computer Science and in Computer Engineering from the University of Utah.

View all posts by Robert Evans

About Nghia Truong
Nghia Truong is a senior systems software engineer at NVIDIA developing the RAPIDS Accelerator for Apache Spark. Nghia received a PhD in Scientific Computing from School of Computing - University of Utah, where he studied a wide range of problems in physics-based simulation and real-time rendering in Computer Graphics.

View all posts by Nghia Truong