This article will discuss how to prepare text through vectorization, hashing, tokenization, and other techniques, to be compatible with machine learning (ML) and other numerical algorithms. I’ll explain and demonstrate the process.
Natural language processing (NLP) applies machine learning (ML) and other techniques to language. However, machine learning and other techniques typically work on the numerical arrays called vectors representing each instance (sometimes called an observation, entity, instance, or row) in the data set. We call the collection of all these arrays a matrix; each row in the matrix represents an instance. Looking at the matrix by its columns, each column represents a feature (or attribute).
So far, this language may seem rather abstract if one isn’t used to mathematical language. However, when dealing with tabular data, data professionals have already been exposed to this type of data structure with spreadsheet programs and relational databases.
After all, spreadsheets are matrices when one considers rows as instances and columns as features. For example, consider a dataset containing past and present employees, where each row (or instance) has columns (or features) representing that employee’s age, tenure, salary, seniority level, and so on.
The first problem one has to solve for NLP is to convert our collection of text instances into a matrix form where each row is a numerical representation of a text instance — a vector. But, in order to get started with NLP, there are several terms that are useful to know. Let’s introduce them.
In NLP, a single instance is called a document, while a corpus refers to a collection of instances. Depending on the problem at hand, a document may be as simple as a short phrase or name or as complex as an entire book.
One has to make a choice about how to decompose our documents into smaller parts, a process referred to as tokenizing our document. It follows that this process produces tokens. Tokens are the units of meaning the algorithm can consider. The set of all tokens seen in the entire corpus is called the vocabulary.
A common choice of tokens is to simply take words; in this case, a document is represented as a bag of words (BoW). More precisely, the BoW model scans the entire corpus for the vocabulary at a word level, meaning that the vocabulary is the set of all the words seen in the corpus. Then, for each document, the algorithm counts the number of occurrences of each word in the corpus.
Most words in the corpus will not appear for most documents, so there will be many zero counts for many tokens in a particular document. Conceptually, that’s essentially it, but an important practical consideration to ensure that the columns align in the same way for each row when we form the vectors from these counts. In other words, for any two rows, it’s essential that given any index k, the kth elements of each row represent the same word.
Before getting into the details of how to assure that rows align, let’s have a quick look at an example done by hand. We’ll see that for a short example it’s fairly easy to ensure this alignment as a human. Still, eventually, we’ll have to consider the hashing part of the algorithm to be thorough enough to implement — I’ll cover this after going over the more intuitive part.
Suppose our corpus is the following four sentences:
“This is the first document.”
“This document is the second document.”
“And this is the third one.”
“Is this the first document?”
Let’s apply some preprocessing to remove case and punctuation:
“this is the first document”
“this document is the second document”
“and this is the third one”
“is this the first document”
Let’s tokenize the preprocessed documents by designating each word as a token:
“this”, “is”, “the”, “first”, “document”
“this”, “document”, “is”, “the”, “second”, “document”
“and”, “this”, “is”, “the”, “third”, “one”
“is”, “this”, “the”, “first”, “document”
Getting the vocabulary
Scanning through the corpus and getting each unique word, we can form our vocabulary:
“this”, “is”, “the”, “first”, “document”, “second”, “and”, “third”, “one”
Let’s count the number of occurrences of each word in each document.
“this”: 1, “is”: 1, “the”: 1, “first”: 1, “document”: 1, “second”: 0, “and”: 0, “third”: 0, “one”: 0
“this”:1, “is”: 1, “the”: 1, “first”: 0, “document”: 2, “second”: 1, “and”: 0, “third”: 0, “one”: 0
“this”: 1, “is”: 1, “the”: 1, “first”: 0, “document”: 0, “second”: 0, “and”: 1, “third”: 1, “one”: 1
“this”: 1, “is”: 1, “the”: 1, “first”: 1, “document”: 1, “second”: 0, “and”: 0, “third”: 0, “one”: 0
Let’s collect this into a table.
If we ignore the header, this is the matrix we were looking for.
It is worth noting that permuting the row of this matrix and any other design matrix (a matrix representing instances as rows and features as columns) does not change its meaning. The same is true for column permutations. Depending on how we map a token to a column index, we’ll get a different ordering of the columns, but no meaningful change in the representation.
This process of mapping tokens to indexes such that no two tokens map to the same index is called hashing. A specific implementation is called a hash, hashing function, or hash function.
Vocabulary based hashing
While doing vectorization by hand, we implicitly created a hash function. Assuming a 0-indexing system, we assigned our first index, 0, to the first word we had not seen. Then we incremented the index and repeated the process. Our hash function mapped “this” to the 0-indexed column, “is” to the 1-indexed column and “the” to the 3-indexed columns. A vocabulary-based hash function has certain advantages and disadvantages.
Advantages of vocabulary based hashing
Using the vocabulary as a hash function allows us to invert the hash. This means that given the index of a feature (or column), we can determine the corresponding token. One useful consequence is that once we have trained a model, we can see how certain tokens (words, phrases, characters, prefixes, suffixes, or other word parts) contribute to the model and its predictions. We can therefore interpret, explain, troubleshoot, or fine-tune our model by looking at how it uses tokens to make predictions. We can also inspect important tokens to discern whether their inclusion introduces inappropriate bias to the model.
Let’s consider the artifacts produced by some machine learning models. For example, if we use a Logistic Regression model, we can interpret the coefficient associated with each feature as its effects on the model’s prediction. Random forest models yield feature importances, which tell us how often decision trees in the random forest use each feature to make decisions. Likewise, a Naive Bayes model produces the probability that a feature is non-zero for a specified class.
The power of vocabulary-based vectorization lies in understanding which token each feature represents. So, instead, with a Logistic Regression model, we can see how strongly each token affects the prediction. With Random forests, we get feature importance associated with each token, which tells us how often the decision trees in the random forest make decisions using each token. With naive Bayes, we can extract the probability of a certain token appearing in documents of each class.
If we see that seemingly irrelevant or inappropriately biased tokens are suspiciously influential in the prediction, we can remove them from our vocabulary. If we observe that certain tokens have a negligible effect on our prediction, we can remove them from our vocabulary to get a smaller, more efficient and more concise model.
Disadvantages of vocabulary based hashing
There are a few disadvantages with vocabulary-based hashing, the relatively large amount of memory used both in training and prediction and the bottlenecks it causes in distributed training.
One downside to vocabulary-based hashing is that the algorithm must store the vocabulary. With large corpuses, more documents usually result in more words, which results in more tokens. Longer documents can cause an increase in the size of the vocabulary as well.
On a single thread, it’s possible to write the algorithm to create the vocabulary and hashes the tokens in a single pass. However, effectively parallelizing the algorithm that makes one pass is impractical as each thread has to wait for every other thread to check if a word has been added to the vocabulary (which is stored in common memory). Without storing the vocabulary in common memory, each thread’s vocabulary would result in a different hashing and there would be no way to collect them into a single correctly aligned matrix.
A better way to parallelize the vectorization algorithm is to form the vocabulary in a first pass, then put the vocabulary in common memory and finally, hash in parallel. This approach, however, doesn’t take full advantage of the benefits of parallelization. Additionally, as mentioned earlier, the vocabulary can become large very quickly, especially for large corpuses containing large documents.
Fortunately, there is an alternative way of hashing tokens: hash each instance with a non-cryptographic mathematical hash function. This type of hash function uses a combination of arithmetic, modular arithmetic, and algebra to map objects (represented by their bits) to a known range of integers or(bits). Since the range is known, the maximum value determines how many columns are in the matrix. Generally, the range is quite large, but for most rows, most columns will be 0. Therefore, with a sparse representation, the memory required to store the matrix will be minimal, and algorithms can efficiently handle sparse matrix-based operations.
Further, since there is no vocabulary, vectorization with a mathematical hash function doesn’t require any storage overhead for the vocabulary. The absence of a vocabulary means there are no constraints to parallelization and the corpus can therefore be divided between any number of processes, permitting each part to be independently vectorized. Once each process finishes vectorizing its share of the corpuses, the resulting matrices can be stacked to form the final matrix. This parallelization, which is enabled by the use of a mathematical hash function, can dramatically speed up the training pipeline by removing bottlenecks.
Although the use of mathematical hash functions can reduce the time taken to produce feature vectors, it does come at a cost, namely the loss of interpretability and explainability. Because it is impossible to map back from a feature’s index to the corresponding tokens efficiently when using a hash function, we can’t determine which token corresponds to which feature. So we lose this information and therefore interpretability and explainability.
In this article, we’ve seen the basic algorithm that computers use to convert text into vectors. We’ve resolved the mystery of how algorithms that require numerical inputs can be made to work with textual inputs.
Textual data sets are often very large, so we need to be conscious of speed. Therefore, we’ve considered some improvements that allow us to perform vectorization in parallel. We also considered some tradeoffs between interpretability, speed and memory usage.
By applying machine learning to these vectors, we open up the field of nlp (Natural Language Processing). In addition, vectorization also allows us to apply similarity metrics to text, enabling full-text search and improved fuzzy matching applications.
- Run State of the Art NLP Workloads at Scale with RAPIDS, HuggingFace, and Dask
- NLP and Text Processing with RAPIDS: Now Simpler and Faster
- This example comes from the SciKit-learn documentation: sklearn.feature_extraction.text.CountVectorizer