Data Science

Topic Modeling and Image Classification with Dataiku and NVIDIA Data Science

Twitter topic model Dataiku diagram

The Dataiku platform for everyday AI simplifies deep learning. Use cases are far-reaching, from image classification to object detection and natural language processing (NLP). Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments.

This post dives into high-level Dataiku and NVIDIA integrations for image classification and object detection. It also covers deep learning model deployment for real-time inference and how to use open source RAPIDS and cuML libraries for a customer support Tweet topic modeling use case. NVIDIA provides the hardware (NVIDIA A10 Tensor Core GPUs, in this case) and various OSS (CUDA, RAPIDS) to get the job done.   

Note that all of the NVIDIA AI software featured in this post is available with NVIDIA AI Enterprise, a secure, end-to-end software suite for production AI, with enterprise support from NVIDIA. 

Deep learning for image classification and object detection

This section walks through the steps to train and evaluate a deep learning model for image classification or object detection using Dataiku and NVIDIA GPUs. 

A no-code approach

Starting with Dataiku 11.3, you can use visual, no-code tools to deliver on the core areas of an image classification or object detection workflow. You can label images, draw bounding boxes, and review/govern annotations using a native web app (Figure 1). Image labeling is key to training performant models: good data in → good model out.

Screenshot of Dataiku’s image labeling tool.
Figure 1. With Dataiku’s image labeling tool, you can label all cats as “cat,” or with more granularity to fit unique appearance or personality characteristics

Dataiku enables you to train image classification and object detection models specifically using transfer learning to fine-tune pretrained models based on custom images / labels / bounding boxes. Data augmentation—recoloring, rotating, and cropping training images—is a common way to increase the size of the training set and expose a model to a variety of situations (Figure 2).

Screenshot of Dataiku’s image augmentation options.
Figure 2. Using image augmentation, you can account for what may be unanticipated in your model, like cat discos and upside-down camera shots

EfficientNet (image classification) and Faster R-CNN (object detection) neural networks can be used with pretrained weights in the model retraining user interface, out of the box.

After training a model to custom image labels and bounding boxes, you can use an overlaid heat map model focus to explain the model’s predictions (Figure 3).

Screenshot of Dataiku’s image classification model interpretation tool.
Figure 3. Heat maps show which parts of an image led to a particular prediction from the model

Once you are comfortable with the model’s performance, deploy the trained model as a containerized inference service to a Kubernetes cluster. This is managed by the Dataiku API Deployer tool.

Where is the compute happening?

Dataiku can push all the compute behind deep learning model training, explanations, and inference to NVIDIA GPUs (Figure 4). You can even leverage multiple GPUs for distributed training through the PyTorch DistributedDataParallel module and TensorFlow MirroredStrategy.

Screenshot of Dataiku interface for activating GPUs for deep learning model training.
Figure 4. Use the Dataiku interface to activate NVIDIA GPUs for deep learning model training

Pushing this compute to NVIDIA GPUs happens through the Dataiku Elastic AI integrations. First, connect your Dataiku instance to a Kubernetes cluster with NVIDIA GPU resources (managed through EKS, GKE, AKS, OpenShift). Dataiku will then create the Docker image and deploy containers behind the scenes. 

Deep learning training and inference jobs can run on a Kubernetes cluster, as well as arbitrary Python code or Apache Spark jobs (Figure 5).

Diagram of Dataiku’s ability to run the compute for different processes in containers on a Kubernetes cluster.
Figure 5. Dataiku is the interface, or orchestration layer, that pushes deep learning and other workloads to a Kubernetes cluster, where the compute happens

Coding model training scripts

If you want to custom code your own deep learning models in Python, try wrapping a train function in an MLflow experiment tracker. Figure 6 shows a Python-based flow. See the machine learning tutorial in the Dataiku Developer Guide for an example. This approach provides the full flexibility of custom code, along with some out-of-the-box experiment tracking, model analysis visualizations, and point-and-click model deployment from visual trained models in Dataiku. 

Screenshot of a Dataiku flow to train a custom python model with MLflow. The workflow uses Python recipes to train a custom image classification model using PyTorch and MLflow, then saves trained model versions into a folder, then imports the best one into a Dataiku green diamond model.
Figure 6. A best-practice Dataiku workflow to train a custom Python model with MLflow

Custom Python deep learning models can leverage NVIDIA GPUs through containerized execution, like visually trained deep learning models in Dataiku (Figure 7).

Screenshot of a Python script in Dataiku and containerized compute options.
Figure 7. Any Python workload in Dataiku can be pushed to a Kubernetes cluster with NVIDIA GPU resources

Model deployment for real-time inference

Once the model is trained, it is time to deploy it for real-time inference. If you used Dataiku’s visual image classification, object detection, or a custom coded model with MLflow, then imported as a Dataiku model, all it takes is a few clicks to create a containerized inference API service on top of the trained model.

First, connect the Dataiku API Deployer tool to a Kubernetes cluster to host these inference API services, again with NVIDIA GPUs available in the cluster nodes. Then deploy 1-N replicas of the containerized service behind a load balancer. From here, edge devices can send requests to the API service and receive predictions back from the model. Figure 8 shows this whole architecture.

Diagram showing Dataiku trained model > create API service in the API Designer > push the API service to the Deployer > push the API Service to a K8S cluster with NVIDIA GPU resources. From there, edge devices can submit requests to the API service with data, images, and receive predictions back.
Figure 8. Workflow from a model trained in Dataiku to an API services hosted on a Kubernetes cluster with NVIDIA GPUs for inference

Tutorial: Accelerate topic modeling using BERT models with RAPIDS in Dataiku

For a deeper dive, this section walks through how to set up a Python environment in Dataiku to use BERTopic with GPU-accelerated cuML library from RAPIDS. It also highlights the performance gain using cuML. 

This example uses the Kaggle Customer Support on Twitter dataset and key customer complaint themes with topic modeling.

Step 1. Prepare the dataset

First, normalize the Tweet text by removing the punctuation, stop words, and stemming words. Also filter the dataset to the complaints customers tweeted in English. All of this can be done using Dataiku visual recipes. Figure 9 shows the screenshot of the workflow in Dataiku.

Flow in Dataiku is the Interface to build Project Pipelines. This is the image of the Data Preparation Flow, beginning with tweets on the left and ending with tm_train, tm_test, and tweet_response_by_org on the right.
Figure 9. Building a project workflow using Dataiku Flow

Use a split recipe to filter the company’s replies from initial user Tweets. Next, use Dataiku’s Text Preparation plugin recipe to detect language distribution across user Tweets. Figure 10 shows the distribution of Tweets by language.

Distribution of Tweet data by language: English, Spanish, French, Portuguese, Dutch, Chinese, German, Turkish, Latin, Italian, Japanese.
Figure 10. Distribution of Tweet data by language

Use a filter recipe to filter out all non-English and blank Tweets. Be sure to use a text preparation recipe to filter stop words, punctuation, URL, emojis, and so on. Convert text to lowercase.

Screenshot of Text Cleaning configuration with all list of Tokens to be filtered out, including punctuation, URL, email, and more.
Figure 11. Configuration of the Dataiku plugin for text cleaning 

Finally, use a split recipe to split the data for training and testing (simple 80% / 20% random split).

Step 2. Set up Python environment with BERTopic and RAPIDS library

Running the Python processes requires an elastic compute environment with NVIDIA GPUs, the Python BERTopic package (and its required packages), and a RAPIDS container image. This example uses an Amazon EKS Cluster (Instance Type:g4dn with NVIDIA A10 Tensor Core GPUs), RAPIDS Release Stable 22.12, and BERTopic (0.12.0).

First, launch an EKS Cluster in Dataiku. Once the cluster is set up, you can check its status and configuration in the Clusters Tab under Administration.

Screenshot of the Admin Dashboard to manage EKS Clusters in Dataiku.
Figure 12. Dataiku dashboard of configured clusters

BERTopic

Create a Dataiku code environment with BERTopic and its required packages using Dataiku’s managed virtual code environments.

RAPIDS

Build a container environment using the RAPIDS image from Docker Hub. In Dataiku, use either the Dataiku base image for your code environment, or download custom container images from DockerHub or NGC. Then, attach your Dataiku code environment to it. Note that NVIDIA has released a RAPIDS on PyPi, so you can now just use the default Dataiku base image.

Step 3. Running BERTopic with default UMAP 

Next, use BERTopic to source the top five topics from the Twitter complaints. To accelerate the UMAP process on GPU, use cuML UMAP.  The default UMAP is provided below:

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from bertopic import BERTopic

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Read the train dataset in the dataframe and the variable sample_size which defines the number of records to be used
sample_size = dataiku.get_custom_variables()["sample_size"]
train_data = dataiku.Dataset("train_cleaned")
train_data_df = train_data.get_dataframe(sampling='head',limit=sample_size)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Create Bertopic object and run fit transform
topic_model = BERTopic(calculate_probabilities=True,nr_topics=4)
topics, probs = topic_model.fit_transform(train_data_df["Review Description_cleaned"])
all_topics_rapids_df = topic_model.get_topic_info()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#Write the List of Topics output as a DSS Dataset

Topic_Model_df = all_topics_rapids_df 
Topic_Model_w_Rapids = dataiku.Dataset("Topic_Model")
Topic_Model_w_Rapids.write_with_schema(Topic_Model_df)


RAPIDS cuML UMAP:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

from bertopic import BERTopic
from cuml.manifold import UMAP
from cuml.cluster.hdbscan.prediction import approximate_predict

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Read the train dataset in the dataframe and the variable sample_size which defines the number of records to be used
sample_size = dataiku.get_custom_variables()["sample_size"]
train_data = dataiku.Dataset("train_cleaned")
train_data_df = train_data.get_dataframe(sampling='head',limit=sample_size)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Create a cuML UMAP Obejct and pass it in the Bertopic object and run fit transform
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
cu_topic_model = BERTopic(calculate_probabilities=True,umap_model=umap_model,nr_topics=4)
cu_topics, cu_probs = cu_topic_model.fit_transform(train_data_df["Review Description_cleaned"])
all_topics_rapids_df = cu_topic_model.get_topic_info()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write the List of Topics output as a DSS Dataset

Topic_Model_w_Rapids_df = all_topics_rapids_df 
Topic_Model_w_Rapids = dataiku.Dataset("Topic_Model_w_Rapids")
Topic_Model_w_Rapids.write_with_schema(Topic_Model_w_Rapids_df)

UMAP is a substantial contributor to overall compute time. Running UMAP on an NVIDIA GPU with RAPIDS cuML resulted in a 4x performance speedup. Additional improvement can be achieved by running more of the algorithm on GPU, such as with cuML HDBSCAN.

Topic modeling process without/with RAPIDSRuntime
Without RAPIDS12 minutes 21 seconds
With RAPIDS2 minutes 59 seconds
Table 1. Configuring with RAPIDS AI results in a 4x performance speedup

Step 4. Complaint Clustering Dashboard

Finally, you can build various cool-looking charts on the output datasets (with cleaned Tweet text and topics) in Dataiku and push to a dashboard for executive team review (Figure 13).

Dataiku dashboard displaying a variety of metrics, including a pie chart, bar graph, and word cloud.
Figure 13. Dataiku dashboard displaying a variety of metrics in one central location

Putting it all together

If you are looking to use deep learning for an image classification, object detection, or NLP use case, Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments. Tight integrations with the latest NVIDIA data science libraries and hardware for compute make for a complete stack.

Additional resources

Check out the resources below to learn more.

Discuss (0)

Tags