Data Science

Harnessing GPU Acceleration for Multi-Label Classification with RAPIDS cuML

Modern classification workflows often require classifying individual records and data points into multiple categories instead of just assigning a single label.

Open-source Python libraries like scikit-learn make it easier to build models for these multi-label problems. Several models have built-in support for multi-label datasets, and a simple scikit-learn utility function enables using those that don’t for these use cases, too.

However, training these multi-label models is computationally expensive, and CPU-based infrastructure can’t keep up with the volume of data enterprises are generating every year.

RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuML is a GPU-accelerated machine learning library for Python with a scikit-learn compatible API.

In this blog post, we show how RAPIDS cuML makes it easy to unlock large speedups for multi-label machine learning workflows with accelerated computing.

Why multi-label classification?

In some enterprise use cases, the goal is to build models to predict a single label for every record. Payment processors need to flag whether a transaction is valid or fraudulent, and manufacturers may need to identify specific products from sensor or camera data to route appropriately within a distribution center.

In each of these scenarios, the labels are mutually exclusive. A transaction is either valid or fraudulent — but not both.

Multi-label classification is used when we don’t have mutually exclusive categories. 

For example, a healthcare provider may want to predict the presence of multiple conditions from the same set of patient data. Or, a newspaper may want to classify an article as being about both finance and world news so a recommendation system can surface it to readers of both sections.

Multi-label classification enables us to handle these kinds of problems more effectively, rather than being forced to pick one label and sacrifice the nuance.

Using RAPIDS cuML for multi-label classification

RAPIDS cuML can be dropped right into existing scikit-learn workflows.

Like scikit-learn, some RAPIDS cuML estimators like KNeighborsClassifier provide built-in support for multi-label classification. In the example below, we create a synthetic multi-label dataset using scikit-learn and directly use it with cuML’s KNeighborsClassifier.

from sklearn.datasets import make_multilabel_classification
from cuml.neighbors import KNeighborsClassifier

X, y = make_multilabel_classification(
    n_samples=10000,
    n_features=20,
    n_classes=5,
    random_state=12
)

clf = KNeighborsClassifier(n_neighbors=10).fit(X, y)
preds = clf.predict(X)
preds[:5]
array([[0, 0, 1, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0]])

Because each record in our dataset can belong to up to five categories (n_classes=5), our estimator outputs five predictions for each row.

To use models like Support Vector Machines that don’t include built-in support for multiple labels, scikit-learn provides a MultiOutputClassifier utility function. We can use this with cuML estimators just like we would with scikit-learn estimators.

This utility function requires training a separate model for each of the five categories, increasing computational demands by a factor of five. As a result, accelerated computing becomes even more essential in these scenarios.

from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from cuml.svm import SVC

X, y = make_multilabel_classification(
    n_samples=10000,
    n_features=20,
    n_classes=5,
    random_state=12
)

base = cuml.svm.SVC()
clf = MultiOutputClassifier(base).fit(X, y)
preds = clf.predict(X)
preds[:5]
array([[0, 0, 1, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0]])

Conclusion

Many real-world machine learning challenges involve multi-label classification, but CPU-based processing struggles to keep up with the growing size of datasets when using tools from the Python ecosystem.

RAPIDS cuML fits right into this ecosystem, making it easy to tap into accelerated computing for training multi-label classification models.

To learn more about accelerated machine learning and see whether your favorite estimator supports multi-label classification by default or needs the utility function, visit the cuML documentation.

Discuss (0)

Tags