This post was originally published on the RAPIDS AI blog.
TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this post, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows.
Introduction
In our previous post, we showed basic text preprocessing with RAPIDS. Since then, we have come a long way in speed improvements, memory reductions, and API simplification.
Here is what we’ll cover in this post:
Built-in, Simplified String and Categorical Support
GPU TextVectorizers: Leaner and Meaner
Accelerating Diverse String Workflows
Built-in Support for Strings and Categoricals
Goodbye, cuStrings, nvStrings, and nvCategory! We hardly knew ye. Our first couple of posts about string manipulation on GPUs involved separate, specialized libraries for working with string data on the device. It also required significant expertise to integrate with other RAPIDS libraries like cuDF and cuML. Since then, we open-sourced, rearchitected, and migrated those string and text-related features into more user-friendly DataFrame APIs as part of cuDF. In addition, we adopted the “Apache Arrow” format for cuDF’s string representation, resulting in substantial memory savings and speedups.
As a concrete, non-toy example of these improvements, consider our recently updated Gutenberg corpus analysis notebook. Previously we had to (slowly) jump through a few hoops, but no longer!
With our improved Pandas string API coverage, we not only have simpler code, but we also get double the performance. We took 2.31s previously, now we only take 1.05s, pushing our overall speedup against Pandas to 151x.
Check out the comparison between the previous versus updated notebooks below.
Previous:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CPU times: user 1.6 s, sys: 708 ms, total: 2.3 s
Wall time: 2.31 s
0 story champions round table
1 written illustrated
2 howard pyle
3 1902 distinguished american artist howard pyle...
4 illustrate legend king arthur knights round
Name: text, dtype: object
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CPU times: user 816 ms, sys: 240 ms, total: 1.06 s
Wall time: 1.05 s
text author title
0 geological observations south america Charles Darwin Geological Observations On South America
1 charles darwin Charles Darwin Geological Observations On South America
2 editorial note Charles Darwin Geological Observations On South America
3 although respects technical subjects style Charles Darwin Geological Observations On South America
4 darwin journal books reprinted never lose value Charles Darwin Geological Observations On South America
We recently launched the feature.text subpackage in cuML by adding Count and TF-IDF vectorizers, kick starting a series of natural language processing (NLP) transformers on GPUs.
In our recent NLP post, we analyzed 5 million COVID-related tweets by first vectorizing them using TF-IDF and then clustering and searching in the vector space. With our recent improvements (GitHub 2554, 2575, 5666), we have improved that TF-IDF vectorization of that workflow on both memory and run time fronts.
Peak memory usage decreased from 19 GB to 8 GB.
Run time improved from 26s to 8 s, pushing our overall speed up to 21x over scikit-learn.
All the preceding improvements mean that your TF-IDF work can scale much further.
In the next installment where we put all these features through their paces in a specialized NLP benchmark. In the meantime,try RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!
About Vibhu Jawa
Vibhu Jawa is a software engineer and data scientist on the RAPIDS team at NVIDIA, where his efforts are focused on building GPU-accelerated data science products. Prior to NVIDIA, Vibhu completed his M.S at Johns Hopkins where his research was focused on Natural Language Processing and building interpretable machine learning models for healthcare.