This post was originally published on the RAPIDS AI blog.
TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this post, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows.
In our previous post, we showed basic text preprocessing with RAPIDS. Since then, we have come a long way in speed improvements, memory reductions, and API simplification.
Here is what we’ll cover in this post:
- Built-in, Simplified String and Categorical Support
- GPU TextVectorizers: Leaner and Meaner
- Accelerating Diverse String Workflows
Built-in Support for Strings and Categoricals
Goodbye, cuStrings, nvStrings, and nvCategory! We hardly knew ye. Our first couple of posts about string manipulation on GPUs involved separate, specialized libraries for working with string data on the device. It also required significant expertise to integrate with other RAPIDS libraries like cuDF and cuML. Since then, we open-sourced, rearchitected, and migrated those string and text-related features into more user-friendly DataFrame APIs as part of cuDF. In addition, we adopted the “Apache Arrow” format for cuDF’s string representation, resulting in substantial memory savings and speedups.
Old Categorization Using
Updated Categorization Using Inbuilt Categorical
As a concrete, non-toy example of these improvements, consider our recently updated Gutenberg corpus analysis notebook. Previously we had to (slowly) jump through a few hoops, but no longer!
With our improved Pandas string API coverage, we not only have simpler code, but we also get double the performance. We took 2.31s previously, now we only take 1.05s, pushing our overall speedup against Pandas to 151x.
Check out the comparison between the previous versus updated notebooks below.
GPU TextVectorizers: leaner and meaner
We recently launched the feature.text subpackage in cuML by adding Count and TF-IDF vectorizers, kick starting a series of natural language processing (NLP) transformers on GPUs.
Since then, we have added hashing vectorizer (20x faster than scikit-learn) and improved our existing Count/TF-IDF vectorizer performance by 3.3x and memory by 2x.
In our recent NLP post, we analyzed 5 million COVID-related tweets by first vectorizing them using TF-IDF and then clustering and searching in the vector space. With our recent improvements (GitHub 2554, 2575, 5666), we have improved that TF-IDF vectorization of that workflow on both memory and run time fronts.
- Peak memory usage decreased from 19 GB to 8 GB.
- Run time improved from 26s to 8 s, pushing our overall speed up to 21x over scikit-learn.
All the preceding improvements mean that your TF-IDF work can scale much further.
Scale-out TF-IDF across multiple machines
You can also scale your TF-IDF workflow to multiple GPUs and machines using cuml’s distributed TF-IDF Transformer. The transformer gives you a distributed vectorized matrix, which can be used with distributed machine learning models like cuml.dask.naive_bayes to get end-to-end acceleration across machines.
Accelerating diverse string workflows
We are adding more string functionality like character_tokenize, character_ngrams, ngram_tokenize, filter_tokens, filter_aphanum, as well as, adding higher-level text-processing API’s like GPU-accelerated BERT tokenizer, text vectorizers, helping enable more complex string and text manipulation logic like you find in real-world NLP applications.
In the next installment where we put all these features through their paces in a specialized NLP benchmark. In the meantime, try RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!