Computer Vision / Video Analytics

New Foundational Models and Training Capabilities with NVIDIA TAO 5.5

GIF shows multiple photos and images selected within the photos according to a prompt, such as "person with glasses" or "tallest cat."

Aug 28, 2024

By Monika Jhuria, Shubham Agrawal, Varun Praveen, Chintan Shah and Debraj Sinha

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA TAO 5.5 introduces new features such as multi-modal sensor fusion models, auto-labeling with text prompts, open-vocabulary detection, and knowledge distillation to enhance AI model development.
The framework integrates open-source, foundation, and proprietary models, all trained on extensive proprietary and commercially viable datasets, making them versatile for tasks such as object detection and image classification.
NVIDIA TAO simplifies fine-tuning these models for specific use cases and is optimized for performance on NVIDIA hardware, ensuring powerful and efficient AI solutions.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA TAO is a framework designed to simplify and accelerate the development and deployment of AI models. It enables you to use pretrained models, fine-tune them with your own data, and optimize the models for specific use cases without needing deep AI expertise.

TAO integrates seamlessly with the NVIDIA hardware and software ecosystem, providing tools for efficient AI model training, deployment, and inferencing and speeding up time-to-market for AI-driven applications.

Flowchart shows the integration of models from the NGC Model Catalog and ONNX Model Zoo into TAO, which includes components like AutoML, Training & Optimization, AI-assisted annotation, and Augmentation. They are connected through an API gateway, which interfaces with training data and REST APIs. — *Figure 1. NVIDIA TAO architecture*

Figure 1 shows that TAO supports frameworks like PyTorch, TensorFlow, and ONNX. Training can be done on multiple platforms, and the resulting models can be deployed on various inference platforms, such as GPU, CPU, MCU, and DLA.

NVIDIA just released TAO 5.5, bringing state-of-the-art foundation models and groundbreaking features to enhance any AI model development. The new features include the following:

Multi-modal sensor fusion models: Integrate data from multiple sensors into a unified bird’s-eye view (BEV) representation, preserving both geometric and semantic information.
Auto-labeling with text prompts: Automatically create label datasets for object detection and segmentation using text prompts.
Open-vocabulary detection: Identify objects from any category using natural language descriptions instead of predefined labels.
Knowledge distillation: Create smaller, more efficient, and accurate networks from the knowledge of larger networks.

In this post, we discuss the new features of TAO 5.5 in more detail.

Model	NVIDIA TensorRT Optimized	Training Dataset	Supported Backbone
GroundingDINO	✓	Real data: 1.8M images, Labels: 14.5M instances of object detection and grounding annotations with Auto-labeler	swin_tiny_224_1k swin_base_224_22k swin_base_384_22k swin_large_224_22k swin_large_384_22k
Mask-GroundingDINO	✓	DINOv2-L CLIP-B (pretrained with an NVIDIA proprietary dataset)	swin_tiny_224_1k swin_base_224_22k swin_base_384_22k swin_large_224_22k swin_large_384_22k
BEVFusion	<coming up>	Synthetic data: ~5M images	Image: Swin-Transformers and ResNet-50Lidar: Second
NVCLIP	✓	Real data: 700M images	ViT-H-336 ViT-L-336
SEGIC	✓	Real data: 110K images Synthetic data: 50K images	DINOv2-L CLIP-B (pretrained with an NVIDIA proprietary dataset)
FoundationPose	✓	Synthetic data: 1M images	–

Table 1. New foundation and multi-modal models in TAO 5.5

NVIDIA TAO integrates open-source, foundation, and proprietary models, all trained on extensive proprietary and commercially viable datasets, making them versatile for tasks such as object detection, pose detection, image classification, segmentation, and so on. TAO simplifies fine-tuning these models for specific use cases, making it easy to customize and deploy them commercially.

Most models are accelerated by NVIDIA TensorRT and optimized for performance on NVIDIA hardware, ensuring powerful and efficient AI solutions.

Swapping model backbones in TAO is straightforward and requires no coding, just a simple configuration change. This flexibility enables you to experiment with different architectures like ResNet, shifted window (Swin) transformers, and Fully Attentional Network (FAN), tailoring models to specific needs.

This ease of customization supports diverse applications, such as product identification in retail, medical imaging classification in healthcare, robotic assembly monitoring in manufacturing, and traffic management in smart cities.

GroundingDINO model

Video 1. Open vocabulary object detection with NVIDIA GroundingDINO

Traditional object detection models are limited to recognizing objects from predefined categories (closed-set detection). This constraint makes them ineffective for applications requiring the identification of arbitrary objects based on human inputs, such as specific category names or detailed referring expressions.

GroundingDINO addresses this limitation by integrating a text encoder into the DINO model, transforming it into an open-set object detector. This enables the model to detect any object described by human inputs.

GroundingDINO effectively fuses language and vision modalities by employing a feature enhancer, language-guided query selection, and a cross-modality decoder. This enables the model to generalize concepts beyond predefined categories, achieving superior performance.

GroundingDINO with TAO

This model was trained with a Swin-Tiny backbone using a supervised manner on the NVIDIA proprietary dataset for commercial usage. In addition, BERT-Base was used as the starting weight for the text tower. Finally, GroundingDINO was trained end-to-end on about 1.8M images that were collected from publicly available datasets.

All the raw images used during training have commercial licenses to ensure safe commercial usage.

Model	Precision	mAP	mAP50	mAP75	mAPs	mAPm	mAPI
grounding_dino_swin_tiny	BF16	46.1	59.9	51.0	30.5	49.3	60.8

Table 2. GroundingDINO KPIs on the COCO validation dataset

Mask-GroundingDINO model

Mask-GroundingDINO is a single-stage, open-vocabulary instance segmentation model that generates a segmentation mask around a specific instance of an object. This model is built on top of the GroundingDINO architecture and Conditional Convolutions for Instance Segmentation (CondInst) inspire the segmentation graph.

Diagram shows input as an image and a prompt or list of objects. The image is processed through an image backbone such as Swin and text is processed through a text backbone such as BERT. A feature enhancer is added to align both image and text-based features. Image and text features then go to a query selection and a cross-modality decoder. The decoded features then go to Mask features and eventually go to Mask FCN where it gets combined with object query followed by convolutions and controller. Eventually, it outputs the mask and bbox for the object being requested in the input query. — *Figure 2. Mask-GroundingDINO model architecture*

Mask-GroundingDINO with TAO

CondInst proposes a conditional convolution head for instance segmentation, which updates its convolutional kernel weights based on the input or feature maps. We extended the mask branch and instance-aware mask heads originally designed only for CNNs in CondInst to Transformers or query-based models.

In Mask-GroundingDINO, we chose to use the same encoder and decoder design as in DINO. Still, similar connections can be made to other Transformer encoders and decoders.

Input: Text or image.
- Text can be either a list of categories (box, forklift) or a scene/object description in natural language.
Backbone: Text or image.
- Text backbone can be BERT or other NLP models.
- Image backbone can be a Swin Transformer model or other Transformer or CNN-based backbones.
Feature Enhancer: A stack of self-attention, text-to-image cross-attention, and image-to-text cross-attention layers that promote multimodal alignment.
Query Selection module: Selects top K queries based on the dot product between text features and image features.
Mask FCN head: A few convolutional layers with dynamic kernels.
Controller: A few convolutional layers that predict the weights of the dynamic kernels in the Mask FCN head.

This model was fine-tuned with the commercial GroundingDINO pretrained model on the pseudo-labeled Open Images dataset, which allows commercial usage. The model uses a Swin-Tiny backbone and its segmentation head was fine-tuned end-to-end on about 830K images that have pseudo-labeled ground-truth masks.

You can use this model as is or fine-tune it with TAO. For fine-tuning, we provide the Mask-GroundingDINO Jupyter notebook for an easy and interactive environment where minimal coding is required.

Model	Precision	mAP	mAP50	mask_mAP	mask_mAP50
mask_grounding_dino_swin_tiny	BF16	47.5	61.7	32.9	55.7

Table 3. Mask-GroundingDINO KPIs on the COCO validation dataset

BEVFusion model

Video 2. Create Custom Multi-Modal Fusion Models

In many domains, from autonomous driving to robotics and smart cities, systems rely on various sensors to perceive and interact with their environment. Each sensor type, such as cameras, LiDAR, or radar, provides unique information but also has inherent limitations. For example, cameras capture rich visual details but struggle with depth perception, while LiDAR excels in measuring distances but lacks semantic context.

The challenge is effectively combining these diverse data sources to improve system performance and reliability.

BEVFusion offers a solution to this challenge by integrating data from multiple sensors into a unified bird’s-eye view (BEV) representation. It revolutionizes this process by unifying multi-modal features into a shared BEV representation, preserving both geometric (from LiDAR) and semantic (from camera) information.

BEVFusion with TAO

In TAO, we began with the BEVFusion codebase from mmdet3d to train our model. The original code was enhanced to handle three angles in 3D space (roll, pitch, yaw), whereas initially, it supported only yaw. The TAO-trained BEVFusion model detects people within a single camera view and a LiDAR and creates a 3-D bounding box around people.

Content	AP11	AP40
Evaluation Set	77.71%	79.35%

Table 4. TAO BEVFusion KPIs on the COCO validation dataset for a single camera, single Lidar people detection

CLIP model

Efficiently learning meaningful and contextually rich representations of both images and text is a significant challenge, especially for applications requiring multimodal understanding. Use cases such as zero-shot learning, image captioning, visual search, content moderation, and multimodal interaction demand robust models that can accurately interpret and integrate visual and textual data.

The Contrastive Language-Image Pre-training (CLIP) model addresses this challenge by employing a dual encoder architecture to simultaneously process images and text. Using contrastive learning, CLIP maximizes the similarity between correct image-text pairs while minimizing it for incorrect pairs, enabling the model to learn general representations that capture a wide range of concepts.

This approach enables CLIP to excel in various applications, such as generating descriptive image captions, performing visual searches based on textual queries, and enabling zero-shot learning, where the model can classify images into unseen categories by leveraging its learned representations.

NVCLIP with TAO

In TAO, we provide a pretrained TensorRT-accelerated CLIP model on the NVIDIA dataset. We used ~700M images to train this model. For evaluation, we used 50K validation images from the ImageNet dataset.

You can deploy the model from NVIDIA Jetson to NVIDIA Ampere architecture GPUs.

Model	Top-1 accuracy
ViT-H-336	0.7786
ViT-L-336	0.7629

Table 5. Zero-shot ImageNet validation accuracy of NVCLIP with ViT-H-336 and ViT-L-336 backbones

SEGIC model

Video 3. Use visual prompt for In-context segmentation with NVIDIA TAO

SEGIC is an innovative end-to-end segment-in-context framework that revolutionizes in-context segmentation with a single-vision foundation model (VFM).

Unlike traditional methods, SEGIC captures dense relationships between target images and in-context examples, extracting geometric, visual, and meta instructions to guide the final mask prediction. This approach significantly reduces labeling and training costs while achieving state-of-the-art performance on one-shot segmentation benchmarks.

SEGIC’s versatility extends to diverse tasks, including video object segmentation and open-vocabulary segmentation, making it a powerful tool in the field of image segmentation.

SEGIC with TAO

For the SEGIC model training in TAO, we provided a pretrained ONNX model and a deployment recipe with NVIDIA Triton Inference Server. For training, we used the following resources:

Image encoder: DINOv2-L, trained with proprietary images from NVIDIA.
Meta description encoder: CLIP-B, trained with proprietary images from NVIDIA.

Model	mIOU
SEGIC: DINOv2-L , CLIP-B, mask_decoder	69.57

Table 6. SEGIC model KPIs evaluated against the COCO20i dataset

FoundationPose model

Video 4. 6D object pose estimation and tracking with the NVIDIA TAO FoundationPose model

Accurate 6D object pose estimation and tracking are crucial for various applications, such as robotics, augmented reality, and autonomous driving. However, existing methods often require extensive fine-tuning and are limited by their dependency on either model-based or model-free setups. This creates a challenge in achieving robust performance across different scenarios and objects, particularly when CAD models are unavailable or when dealing with novel objects.

FoundationPose addresses this problem by providing a unified foundation model for 6D object pose estimation and tracking that supports both model-based and model-free setups. It can be applied instantly to novel objects at test time, leveraging either CAD models or a few reference images.

By using a neural implicit representation for novel view synthesis and employing large-scale synthetic training with advanced techniques like an LLM, a novel transformer-based architecture, and contrastive learning, FoundationPose achieves strong generalizability and outperforms specialized methods across various public datasets.

For example, in augmented reality applications, FoundationPose enables seamless integration of virtual objects into real-world environments by accurately estimating and tracking the pose of objects without the need for extensive manual adjustments.

FoundationPose with TAO

We provided a pretrained ONNX model and deployment recipe with NVIDIA Triton Inference Server. Our training data consisted of scenes rendered in high-quality photorealism, using 3D assets from Scanned Objects by Google Research and Objaverse. Each data point includes RGB images, depth information, object poses, camera poses, instance segmentation, and 2D bounding boxes, with extensive domain randomization. We then created ~1M synthetic images with NVIDIA Isaac Sim.

In the worldwide BOP leaderboard, model-based novel object pose estimation was at the #1 position as of March 2024 (Figure 3).

Prompt-based auto-labeling for object detection and segmentation

Having a well-labeled dataset is essential for training and fine-tuning models, especially in tasks like object detection and segmentation. However, obtaining a labeled dataset for new categories and instances takes a lot of time and effort. This is particularly true for instance segmentation, where annotating each object with precise masks can take up to 10x longer than drawing simple bounding boxes.

This is where a simple-to-use, prompt-based, auto-labeling method can help:

GroundingDINO: An open-vocabulary object detection model that uses prompts such as object categories to generate bounding boxes.
Mask Auto-Labeler: A Transformer-based framework that uses the bounding boxes to produce high-quality instance segmentation masks.

This combination significantly reduces the effort required to create detailed labels, making it easier and faster to build robust datasets for training advanced models.

We provide a complete end-to-end Jupyter notebook (text2box.ipynb) for auto labeling with no coding needed. We provide two spec files, one for bbox labels and another for segmentation, where you can define the objects to label, for example [person, helmet], the path of the unlabeled dataset, and the result directory in which to store the labels.

Video 5. Use text prompts for auto-labeling with NVIDIA TAO

Generate the auto labels with the generate command:

tao dataset auto_label generate [-h] -e <spec file>
                                [results_dir=<results_dir>]
                                [num_gpus=<num_gpus>]

Efficient AI with knowledge distillation

Knowledge distillation is a technique in machine learning where a smaller, more efficient model (the student) learns to mimic the behavior of a larger, more complex model (the teacher). This process helps reduce training and fine-tuning time, as the student model uses the knowledge already captured by the teacher, including more nuanced information like soft labels, and doesn’t need to be trained from scratch for any new task.

By distilling this knowledge, the student model can achieve similar performance with significantly fewer computational resources, making it ideal for deployment in resource-constrained environments and speeding up the training process.

TAO supports knowledge distillation. Specify the binding between various layers to formulate compute the distillation loss. For more information, see Optimizing the Training Pipeline.

We provide the dino_distillation.ipynb reference notebook to showcase knowledge distillation and how to distill the intermediate feature maps from DINO and FAN small to DINO and ResNet-50 using an L2 loss.

distill.yaml:
model:
  backbone: resnet_50 # Student model
  train_backbone: True
distill:
  teacher:
    backbone: fan_small
    train_backbone: False
    num_feature_levels: 4 # Number of feature-maps to map from teacher
  pretrained_teacher_model_path: /workspace/tao-experiments/dino/pretrained_dino_coco_vdino_fan_small_trainable_v1.0/dino_fan_small_ep12.pth
  bindings:
  - teacher_module_name: 'model.backbone.0.body'
    student_module_name: 'model.backbone.0.body'
    criterion: L2

In the distill.yaml file, you can define the teacher model, student model backbone, and loss function. Train the model:

!tao model dino distill \ -e $SPECS_DIR/distill.yaml \ results_dir=$RESULTS_DIR/

After the student model is trained, you can use the TAO evaluate tool to evaluate the distilled model, inference tool to visualize the inference, export tool to export the trained model to ONNX, and eventually generate the gen_trt_engine file to export the DINO and ResNet-50 model to TensorRT.

Getting started with TAO 5.5

Developers around the world are using NVIDIA TAO to accelerate AI training for their vision AI applications. Use the new capabilities of TAO 5.5 to enhance your AI application.

For more information, see the following resources:

Discuss (0)

About the Authors

About Monika Jhuria
Monika Jhuria is a technical marketing engineer with the NVIDIA Metropolis team, bringing over 9 years of experience in the tech industry. Before this role, she worked as a computer vision engineer in the automotive industry, focusing on designing and deploying vision models optimized for CNN hardware accelerators. She holds an M.Sc. in electrical engineering, which has helped her build a strong foundation in technology and marketing.

View all posts by Monika Jhuria

About Shubham Agrawal
Shubham Agrawal is an AI developer technology engineer at NVIDIA, where he works on the Metropolis team. He focuses on bringing generative AI-based solutions to industry using vision language models (VLMs). His previous research concentrated on computer vision in the medical domain. He holds a M.Sc. in computer science from Columbia University and a B.Sc. in information technology from NITK Surathkal.

View all posts by Shubham Agrawal

About Varun Praveen
Varun Praveen is a senior software engineer on the Intelligent Video Analytics team working on creating deep learning solutions and deploying them to the edge. Before joining NVIDIA in 2017, he obtained his MS in electrical engineering from Clemson University. His research interests are digital signal processing, computer vision, and machine learning.

View all posts by Varun Praveen

About Chintan Shah
Chintan Shah is a senior product manager at NVIDIA, focusing on AI products. He manages NVIDIA TAO and vision AI solutions for smart spaces, retail, industrial, and other industry verticals. His focus is on simplifying and democratizing AI for all developers and enterprises.

View all posts by Chintan Shah

About Debraj Sinha
Debraj Sinha is a Product Marketing Manager for Metropolis at NVIDIA, focusing on building smarter spaces around the world with AI-enabled video analytics. Debraj collaborates with partners ranging from startups to Fortune 500 companies to market AI applications that drive safety and efficiency gains. He holds an MBA degree from Haas School of Business, University of California, Berkeley and a Master's degree in Computer Science from Cornell University.

View all posts by Debraj Sinha