Developer Blog

AI / Deep Learning | Data Science |

Run State of the Art NLP Workloads at Scale with RAPIDS, HuggingFace, and Dask

This post was originally published on the RAPIDS AI Blog.

TLDR: Learn how to use RAPIDS, HuggingFace, and Dask for high-performance NLP. See how to build end-to-end NLP pipelines in a fast and scalable way on GPUs. This covers feature engineering, deep learning inference, and post-inference processing.


Modern natural language processing (NLP) mixes modeling, feature engineering, and general text processing. Deep learning NLP models can provide fantastic performance for tasks like named-entity recognition (NER), sentiment classification, and text summarization. However, end-to-end workflow pipelines with these models often struggle with a performance at scale, especially when the pipelines involve extensive pre-and post-inference processing.

In our previous blog post, we covered how RAPIDS accelerates string processing and feature engineering. This post explains how to leverage RAPIDS for feature engineering and string processing, HuggingFace for deep learning inference, and Dask for scaling out for end-to-end acceleration on GPUs.

An NLP pipeline often involves the following steps:

  • Pre-processing
  • Tokenization
  • Inference
  • Post Inference Processing
Figure 1: NLP workflow using Rapids and HuggingFace.


Pre-Processing for NLP pipelines involves general data ingestion, filtration, and general reformatting. With the RAPIDS ecosystem, each piece of the workflow is accelerated on GPUs. Check out our recent blog where we showcased these capabilities in more detail.

Once we have pre-processed our data, we need to tokenize it so that the appropriate machine learning model can ingest it.

Subword Tokenization:

Tokenization is the process of breaking down the text into standard units that a machine can understand. It is a fundamental step across NLP methods from traditional like CountVectorizer to advanced deep learning methods like Transformers.

One approach to tokenization is breaking a sentence into words. For example, the sentence, “I love apples” can be broken down into, “I,” “love,” “apples”. But this delimiter based tokenization runs into problems like:

  • Needing a large vocabulary as you will need to store all words in the dictionary.
  • Uncertainty of combined words like “check-in,” i.e., what exactly constitutes a word, is often ambiguous.
  • Some languages don’t segment by spaces.

To solve these problems, we use subword tokenization. Subword tokenization is a recent strategy from machine translation that breaks into subword units, strings of characters like “ing,” “any,” “place.” For example, the word “anyplace” can be broken down into “any” and “place,” so you don’t need an entry for each word in your vocabulary.

When BERT(Bidirectional Encoder Representations from Transformers) was released in 2018, it included a new subword algorithm called WordPiece. This tokenization is used to create input for NLP DL models like BERT, Electra, DistilBert, and more.

GPU Subword Tokenization

We first introduced the GPU BERT subword tokenizer in a previous blog as part of CLX for cybersecurity applications. Since then, we migrated the implementation into RAPIDS cuDF and exposed it as a string function, subword tokenization, making it easier to use in typical DataFrame workflows.

This tokenizer takes a series of strings and returns tokenized cupy arrays: 

def tokenize_text_series(text_ser, seq_len, stride, vocab_hash_file):
This function tokenizes a text series using the bert subword_tokenizer and vocab-hash
text_ser: Text Series to tokenize
seq_len: Sequence Length to use (We add to special tokens for ner classification job)
stride : Stride for the tokenizer
vocab_hash_file: vocab_hash_file to use (Created using `` with compact flag)
A dictionary with these keys {'token_ar':,'attention_ar':,'metadata':}
if len(text_ser) == 0:
return {"token_ar": None, "attention_ar": None, "metadata": None}
max_rows_tensor = len(text_ser) * 2
max_length = seq_len - 2
tokens, attention_masks, metadata = text_ser.str.subword_tokenize(
### reshape metadata into a matrix
metadata = metadata.reshape(-1, 3)
tokens = tokens.reshape(-1, max_length)
output_rows = tokens.shape[0]
padded_tokens = cp.zeros(shape=(output_rows, seq_len), dtype=cp.uint32)
# Mark sequence start with [CLS] token to mark start of sequence
padded_tokens[:, 1:-1] = tokens
padded_tokens[:, 0] = 101
# Mark end of sequence [SEP]
seq_end_col = padded_tokens.shape[1] - (padded_tokens[:, ::-1] != 0).argmax(1)
padded_tokens[cp.arange(padded_tokens.shape[0]), seq_end_col] = 102
del tokens
## Attention mask
attention_masks = attention_masks.reshape(-1, max_length)
padded_attention_mask = cp.zeros(shape=(output_rows, seq_len), dtype=cp.uint32)
padded_attention_mask[:, 1:-1] = attention_masks
# Mark sequence start with 1
padded_attention_mask[:, 0] = 1
# Mark sequence end with 1
padded_attention_mask[cp.arange(padded_attention_mask.shape[0]), seq_end_col] = 1
return {
"token_ar": padded_tokens,
"attention_ar": padded_attention_mask,
"metadata": metadata,
example_data = cudf.Series(['First sequence',
'Second sequence',
### wget
### Created using python3 --vocab 'vocab.txt' --output 'vocab-hash.txt' --compact
d = tokenize_text_series(example_data,5,2,'./vocab-hash.txt')
vocab2int,int2vocab = create_vocab_table('./vocab.txt')
[[ 101 1752 4954  102    0]
 [ 101 2307 4954  102    0]
 [ 101 8362 3113  102    0]]
[['[CLS]' 'First' 'sequence' '[SEP]' '[PAD]']
['[CLS]' 'Second' 'sequence' '[SEP]' '[PAD]']
['[CLS]' 'un' '##ary' '[SEP]' '[PAD]']]

Example of using: cudf.str.subword_tokenize

Advantages of cuDF’s GPU subword Tokenizer:

The advantages of using cudf.str.subword_tokenize include:

  • The tokenizer itself is up to 483x faster than HuggingFace’s Fast RUST tokenizer BertTokeizerFast.batch_encode_plus.
  • Tokens are extracted and kept in GPU memory and then used in subsequent tensors, all without leaving GPUs and avoiding expensive CPU copies.

Once our inputs are tokenized using the subword tokenizer, they can be fed into NLP DL models like BERT for inference.

HuggingFace Overview:

HuggingFace provides access to several pre-trained transformer model architectures ( BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages.In our workflow, we used BERT and DISTIILBERT from HuggingFace to do named entity recognition.

Example of NER in action from

Combining RAPIDS, HuggingFace, and Dask:

This section covers how we put RAPIDS, HuggingFace, and Dask together to achieve 5x better performance than the leading Apache Spark and OpenNLP for TPCx-BB query 27 equivalent pipeline at the 10TB scale factor with 136 V100 GPUs while using a near state of the art NER model. We expect to see even better results with A100 as A100’s BERT inference speed is up to 6x faster than V100’s.

In this workflow, we are given 26 Million synthetic reviews, and the task is to find the competitor company names in the product reviews for a given product. We then return the review id, product id, competitor company name, and the related sentence from the online review. To get a competitor’s name, we need to do NER on the reviews and find all the tokens in the review labeled as an organization.

Our previous implementation relied on spaCy for NER but, spaCy currently needs your inputs on CPU and thus was slow as it required a copy to CPU memory and back to GPU memory. With the new cudf.str.subword_tokenize, we can go from cudf.string.series to subword tensors without leaving the GPU unlocking many new SOTA language models.

In this task, we experimented with two of HuggingFace’s models for NER fine-tuned on CoNLL 2003(English) :

Research by Zhu, Mengdi et al. (2019) showcased that BERT-based model architectures achieve near state art performance, significantly improving the performance on existing public-NER toolkits like spaCy, NLTK, and StanfordNER.

For example, the bert-base model on average across datasets achieves a 13.63% better F1 than spaCy, so not only did we get faster but also reached near state of the art performance.

Check out the workflow code here.


This workflow is just one example of leveraging GPUs to do end to end accelerating natural language processing. With cudf.str.subword_tokenizenow, most of the NLP tasks such as question answering, text-classification, summarization, translation, token classification are all within reach for an end to end acceleration leveraging RAPIDS and HuggingFace.Stay tuned for more examples and in, the meantime, try out RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!