NLP and Text Processing with RAPIDS: Now Simpler and Faster

This post was originally published on the RAPIDS AI blog.

TL;DR: Google famously noted that “speed isn’t just a feature, it’s the feature,” This is not only true for search engines but all of RAPIDS. In this post, we will showcase performance improvements for string processing across cuDF and cuML, which enables acceleration across diverse text processing workflows.

Introduction

In our previous post, we showed basic text preprocessing with RAPIDS. Since then, we have come a long way in speed improvements, memory reductions, and API simplification.

Here is what we’ll cover in this post:

Built-in, Simplified String and Categorical Support
GPU TextVectorizers: Leaner and Meaner
Accelerating Diverse String Workflows

Built-in Support for Strings and Categoricals

Goodbye, cuStrings, nvStrings, and nvCategory! We hardly knew ye. Our first couple of posts about string manipulation on GPUs involved separate, specialized libraries for working with string data on the device. It also required significant expertise to integrate with other RAPIDS libraries like cuDF and cuML. Since then, we open-sourced, rearchitected, and migrated those string and text-related features into more user-friendly DataFrame APIs as part of cuDF. In addition, we adopted the “Apache Arrow” format for cuDF’s string representation, resulting in substantial memory savings and speedups.

Old Categorization Using nvcategory

    import nvcategory
import nvstrings
import cudf
import numpy as np


nvs_s = nvstrings.to_device(['dog', 'fish', 'cat'])
nvcat_s = nvcategory.from_strings(nvs_s)
values_d_ar = rmm.device_array(shape=nvcat_s.size(),dtype=np.int32)
nvcat_s.values(devptr=values_d_ar.device_ctypes_pointer.value)
cudf.Series(values_d_ar)
0    1
1    2
2    0
dtype: int32

  

view raw old_category.md hosted with ❤ by GitHub

Old Catagorization using Nvcategroy.

Updated Categorization Using Inbuilt Categorical `dtype`

    import cudf
s = cudf.Series(['dog', 'fish', 'cat'])
s.astype('category').cat.codes
0    1
1    2
2    0
dtype: uint8

  

view raw new_category.md hosted with ❤ by GitHub

Updated Categorization with inbuilt support

Example workflow

As a concrete, non-toy example of these improvements, consider our recently updated Gutenberg corpus analysis notebook. Previously we had to (slowly) jump through a few hoops, but no longer!

With our improved Pandas string API coverage, we not only have simpler code, but we also get double the performance. We took 2.31s previously, now we only take 1.05s, pushing our overall speedup against Pandas to 151x.

Check out the comparison between the previous versus updated notebooks below.

	STOPWORDS = nltk.corpus.stopwords.words('english')

	filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/', '\\', ':', ';', '<', '=', '>',
	'?', '@', '[', ']', '^', '_', '`', '{', '\|', '}', '\~', '\t','\\n',"'",",",'~' , '—']

	def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):
	"""
	* filter punctuation
	* to_lower
	* remove stop words (from nltk corpus)
	* remove multiple spaces with one
	* remove leading spaces
	"""

	# filter punctuation and case conversion
	input_strs = input_strs.str.replace_multi(filters, ' ', regex=False)
	input_strs = input_strs.str.lower()

	# remove stopwords
	stopwords_gpu = nvstrings.to_device(stopwords)
	input_strs = nvtext.replace_tokens(input_strs.data, stopwords_gpu, ' ')
	input_strs = cudf.Series(input_strs)

	# replace multiple spaces with single one and strip leading/trailing spaces
	input_strs = input_strs.str.replace(r"\s+", ' ', regex=True)
	input_strs = input_strs.str.strip(' ')

	return input_strs

	def preprocess_text_df(df, text_cols=['text'], **kwargs):
	for col in text_cols:
	df[col] = preprocess_text(df[col], **kwargs)
	return df

	%time df = preprocess_text_df(df, filters=filters)

	df['text'].head(5).to_pandas()

view raw 1_gv_100_gutenburg_pre_processing.py hosted with ❤ by GitHub

CPU times: user 1.6 s, sys: 708 ms, total: 2.3 s Wall time: 2.31 s

0                          story champions round table
1                                  written illustrated
2                                          howard pyle
3    1902 distinguished american artist howard pyle...
4          illustrate legend king arthur knights round
Name: text, dtype: object

view raw 2_gv_100_gutenburg_pre_processing_output.md hosted with ❤ by GitHub

Pre-Processing using nvtext+nvstrings

Update:

	STOPWORDS = nltk.corpus.stopwords.words('english')

	filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/', '\\', ':', ';', '<', '=', '>',
	'?', '@', '[', ']', '^', '_', '`', '{', '\|', '}', '\t','\n',"'",",",'~' , '—']

	def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):
	"""
	* filter punctuation
	* to_lower
	* remove stop words (from nltk corpus)
	* remove multiple spaces with one
	* remove leading spaces
	"""

	# filter punctuation and case conversion
	translation_table = {ord(char): ord(' ') for char in filters}
	input_strs = input_strs.str.translate(translation_table)
	input_strs = input_strs.str.lower()

	# remove stopwords
	stopwords_gpu = cudf.Series(stopwords)
	input_strs = input_strs.str.replace_tokens(STOPWORDS, ' ')

	# replace multiple spaces with single one and strip leading/trailing spaces
	input_strs = input_strs.str.normalize_spaces( )
	input_strs = input_strs.str.strip(' ')

	return input_strs

	def preprocess_text_df(df, text_cols=['text'], **kwargs):
	for col in text_cols:
	df[col] = preprocess_text(df[col], **kwargs)
	return df

	%time df = preprocess_text_df(df, filters=filters)

	df.head(5)

view raw 1_gv_100_gutenburg_pre_processing_updated.py hosted with ❤ by GitHub

CPU times: user 816 ms, sys: 240 ms, total: 1.06 s Wall time: 1.05 s

text	author	title
0	geological observations south america	Charles Darwin	Geological Observations On South America
1	charles darwin	Charles Darwin	Geological Observations On South America
2	editorial note	Charles Darwin	Geological Observations On South America
3	although respects technical subjects style	Charles Darwin	Geological Observations On South America
4	darwin journal books reprinted never lose value	Charles Darwin	Geological Observations On South America

view raw 2_gv_100_gutenburg_pre_processing_output.md hosted with ❤ by GitHub

Updated Pre-Processing with the latest API

GPU TextVectorizers: leaner and meaner

We recently launched the feature.text subpackage in cuML by adding Count and TF-IDF vectorizers, kick starting a series of natural language processing (NLP) transformers on GPUs.

Since then, we have added hashing vectorizer (20x faster than scikit-learn) and improved our existing Count/TF-IDF vectorizer performance by 3.3x and memory by 2x.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw hashing_vectorizer_benchmark.ipynb hosted with ❤ by GitHub

Hashing Vectorizer Speed Up vs Sklearn

In our recent NLP post, we analyzed 5 million COVID-related tweets by first vectorizing them using TF-IDF and then clustering and searching in the vector space. With our recent improvements (GitHub 2554, 2575, 5666), we have improved that TF-IDF vectorization of that workflow on both memory and run time fronts.

Peak memory usage decreased from 19 GB to 8 GB.
Run time improved from 26s to 8 s, pushing our overall speed up to 21x over scikit-learn.

All the preceding improvements mean that your TF-IDF work can scale much further.

Scale-out TF-IDF across multiple machines

You can also scale your TF-IDF workflow to multiple GPUs and machines using cuml’s distributed TF-IDF Transformer. The transformer gives you a distributed vectorized matrix, which can be used with distributed machine learning models like cuml.dask.naive_bayes to get end-to-end acceleration across machines.

Accelerating diverse string workflows

We are adding more string functionality like character_tokenize, character_ngrams, ngram_tokenize, filter_tokens, filter_aphanum, as well as, adding higher-level text-processing API’s like GPU-accelerated BERT tokenizer, text vectorizers, helping enable more complex string and text manipulation logic like you find in real-world NLP applications.

In the next installment where we put all these features through their paces in a specialized NLP benchmark. In the meantime, try RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!