Developer Blog

AI / Deep Learning | Autonomous Machines | Data Science | IVA/IoT |

Preparing State-of-the-Art Models for Classification and Object Detection with NVIDIA TAO Toolkit

Accuracy is one of the most important metrics for deep learning models. Greater accuracy is a prerequisite for deploying the trained models to production to solve real-world problems reliably and effectively. Creating highly accurate models from scratch is time-consuming and capital-intensive. As a result, companies with limited data and resources struggle to get their AI solutions to market.

Applying transfer learning techniques helps you create new AI models faster by fine-tuning previously trained neural networks. NVIDIA TAO Toolkit lets you take your own custom dataset and fine-tune it with one of the many popular network architectures to produce a task-specific model. Or you can get on the fast track with readily available, production-quality models for use cases in smart city, retail, robotics, and more. TAO Toolkit simplifies the process of training accurate AI models and optimizing it for inference performance, using techniques like model pruning and quantization.

With TAO Toolkit, you can achieve state-of-the-art accuracy using public datasets while maintaining high inference throughput for deployment. This post shows you how to train object detection and image classification models using TAO Toolkit to achieve the same accuracy as in the literature and open-sourced implementations. We trained on public datasets such as ImageNet, PASCAL VOC, and MS COCO as a comparison with published results in the literature or open-source community. This post discusses the complete workflow to reach state-of-the-art accuracy on several popular model architectures.

This post has three main sections. First, we show you how to prepare the required datasets for classification and object detection. Then, you train classification models on the ImageNet dataset using the VGG16, ResNet50, ResNet101, and EfficientNet B0 networks. TAO Toolkit supports over 15 classification models or backbones and you can use the technique from this post to train other models as well. After you have the ImageNet-based classification model, you then use the pretrained weights from the ImageNet-trained models to train Faster R-CNN, SSD, RetinaNet, and YoloV3 object detection models. The scripts and TAO Toolkit training spec files are available in the tao3.0/misc/dev_blog/SOTA GitHub repo.

Pretrained weights trained on the ImageNet dataset tend to provide good accuracy for object detection. The following graph shows the state-of-the-art accuracy of several top models. In this post, we show the steps to achieve this accuracy with TAO Toolkit. After you achieve the desired accuracy, you can use the model pruning and INT8 quantization features in TAO Toolkit to improve inference performance.

Figure 1. Model accuracy on public datasets.

Sources:

  1. https://keras.io/api/applications
  2. https://keras.io/api/applications. The ResNet backbones used in TAO Toolkit RetinaNet are slightly different from the Keras model. To train a ResNet model compatible with RetinaNet, all_projections should be set to True and use_pooling to False (example).
  3. https://github.com/tensorflow/tpu/tree/r2.4/models/official/efficientnet
  4. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (table 6).
  5. Deep Residual Learning for Image Recognition (table 9).
  6. SSD: Single Shot MultiBox Detector.
  7. GLUON Model Zoo Detection: SSD.
  8. Focal Loss for Dense Object Detection (Table 1(b) (c)).
  9. YOLO: Real-Time Object Detection; Darknet (codebase).
  10. yolov3.weights (evaluation mode is AP50 using 11-points sample, evaluation dataset is the COCO14 validation split previously mentioned)

Preparing the datasets

Here’s how to prepare the required datasets for classification and detection.

Prepare the ImageNet 2012 dataset

Start by downloading the ImageNet classification dataset (choose Download Original Images), which contains more than 140 GB of images. There are two tarballs to download and save to the same directory:

  • ILSVRC2012_img_train.tar (138 GB)—Used for training.
  • ILSVRC2012_img_val.tar (6.3 GB) —Used for validation.

After the dataset has been downloaded, use the imagenet.py Python script and the imagenet_val_maps.pklz validation map file to unzip the two tarballs and restructure the validation dataset. Due to copyright issues, we can’t provide the ImageNet dataset or any ImageNet-pretrained models in TAO Toolkit. Use the following command to unzip and restructure the dataset:

python3.6 imagenet.py --download-dir <tarballs_download_directory>  --target_dir <unzip_target_dir>

Assume that the paths from inside the TAO Toolkit container to the dataset are as follows:

/home/<username>/tao-experiments/data/imagenet2012/train

/home/<username>/tao-experiments/data/imagenet2012/val

The first path is a directory that contains all the training images, where each of the 1K classes has its own subdirectory. The same is assumed for the validation split as well. The structure of the classification dataset follows the TAO Toolkit classification model training requirements. For more information, see the TAO Toolkit User Guide.

Prepare the PASCAL VOC dataset

PASCAL VOC is a public dataset. Download the VOC2007 and VOC2012 datasets from the official website. Use the download_pascal_voc.sh bash script to download and split the PASCAL VOC dataset.

Finally, convert the XML labels to KITTI because XML is not supported by TAO Toolkit. The xml_to_kitti.py Python script handles this conversion.

# convert labels to KITTI format
mkdir -p /home/<username>/tao-experiments/data/voc0712trainval/labels_kitti
mkdir -p /home/<username>/tao-experiments/data/voc07test/labels_kitti
python ./xml_to_kitti.py -i /home/<username>/tao-experiments/data/voc0712trainval/labels -o /home/<username>/tao-experiments/data/voc0712trainval/labels_kitti
python ./xml_to_kitti.py -i /home/<username>/tao-experiments/data/voc07test/labels -o /home/<username>/tao-experiments/data/voc07test/labels_kitti -d

Now that the dataset is ready, run TAO Toolkit. You are using TAO Toolkit through the tao-launcher interface. When running the tao-launcher, map the ~/tao-experiments directory on the local machine to the Docker container using the ~/.tao_mounts.json file. For more information, see TAO Toolkit Launcher.

Install tao-launcher:

!pip3 install nvidia-pyindex
!pip3 install nvidia-tao

Create ~/.tao_mounts.json and add the following content:

{
    "Mounts": [
        {
            "source": "/home/<username>/tao-experiments",
            "destination": "/workspace/tao-experiments"
        },
    ]
}

Mount the path /home/<username>/tao-experiments on the host machine to be the path /workspace/tao-experiments inside the container.

Prepare the COCO 2014 and COCO 2017 datasets

Download the COCO 2014 and 2017 datasets from the COCO website. For COCO 2014, download the following files:

  • train2014.zip
  • val2014.zip
  • annotations_trainval2014.zip

For COCO 2017, download the following files:

  • train2017.zip
  • val2017.zip
  • annotations_trainval2017.zip

The COCO 2014 dataset contains 80 different classes, with roughly 80K images for training and 40K images for validation. The COCO 2017 dataset also contains 80 different classes from COCO 2014, but the dataset split is different. In the COCO 2017, you have around 118K images for training and 5K images for validation. Download the COCO 2014 or 2017 dataset with the following commands.

# create a directory to place the dataset
mkdir -p /home/<username>/tao-experiments/data/coco{$year}
cd /home/<username>/tao-experiments/data/coco{$year}
# training images
wget http://images.cocodataset.org/zips/train{$year}.zip
# validation images
wget http://images.cocodataset.org/zips/val{$year}.zip
# train/val annotations
wget http://images.cocodataset.org/annotations/annotations_trainval{$year}.zip

After downloading, you must do some processing for the COCO 2014 and COCO 2017 datasets to convert the original JSON format to KITTI format.

Unzip the dataset:

cd /home/<username>/tao-experiments/data/coco{$year}
# unzip images
unzip train{$year}.zip
unzip val{$year}.zip
# create a directory for the images
mkdir images
mv train{$year} images
mv val{$year} images
# unzip annotations
unzip annotations_trainval{$year}.zip

To convert the JSON labels to the KITTI format, use the coco2kitti.py Python script. This script depends on the pycocotools package. Check the GitHub repository and install it properly. Run the following commands:

cd /home/<username>/tao-experiments/data/coco{$year}
# the script is named coco2kitti.py
# convert the instances_train{$year}.json to KITTI text
python3 ./coco2kitti.py /home/<username>/tao-experiments/data/coco{$year} train{$year}
# move the output ./labels to a safe place to avoid overwriting
mkdir -p ./KITTI
mv ./labels KITTI/train{$year}
# continue to convert the instances_val{$year}.json to KITTI text
python3 ./coco2kitti.py /home/<username>/tao-experiments/data/coco{$year} val{$year}
# move and rename the output
mkdir -p ./KITTI
mv ./labels KITTI/val{$year}

The KITTI dataset is not perfect, as there are some unpaired images and labels. Fortunately, there are just a few of these problematic images and labels so you can remove them. Use the clean_dataset.sh bash script to clean the dataset.

Copy the script to the /home/<username>/tao-experiments/data/coco{$year} directory:

cd /home/<username>/tao-experiments/data/coco{$year}
./clean_dataset.sh ./images/train{$year} ./KITTI/train{$year}
./clean_dataset.sh ./images/val{$year} ./KITTI/val{$year}

Training classification models on ImageNet 2012

In this section, you train three classification models (VGG16, ResNet101, and EfficientNet B0) on the ImageNet 2012 classification dataset. The ImageNet classification dataset contains more than 1.1 million images over 1000 classes. By pretraining weights on ImageNet, you can get high accuracy on object detection tasks by using this pretrained classification model to initialize the detection models.

Configure the classification training with a spec file

For every TAO Toolkit model training, you have a configuration file (spec file) to configure some necessary parameters used to customize the model and the training process. For more information, see the TAO Toolkit User Guide.

Download the spec files:

Perform image classification and train with TAO Toolkit

After you have the spec files, start the training. Because ImageNet is a big dataset, we recommend using multiple GPUs to reduce training time. For this post, we used a DGX-1v workstation with eight V100 GPUs, where each GPU has 32 GB memory. However, you can also use a single GPU. In the following commands, change the --gpus option value from 8 to the number of GPUs being used.

Run the following command for VGG16 training:

$tao classification train -e /workspace/tao-experiments/classification/vgg16/spec.txt -r /workspace/tao-experiments/classification/vgg16 -k nvidia_tao --gpus 8

Run the following command for ResNet50 training:

$tao classification train -e /workspace/tao-experiments/classification/resnet50/spec.txt -r /workspace/tao-experiments/classification/resnet50 -k nvidia_tao --gpus 8

Run the following command for ResNet101 training:

$tao classification train -e /workspace/tao-experiments/classification/resnet101/spec.txt -r /workspace/tao-experiments/classification/resnet101 -k nvidia_tao --gpus 8

Run the following command for EfficientNet B0 training:

$tao classification train -e /workspace/tao-experiments/classification/efficientnet_b0/spec.txt -r /workspace/tao-experiments/classification/efficientnet_b0 -k nvidia_tao --gpus 8 --use_amp

Run the following command for DarkNet53 training:

$tao classification train -e /workspace/tao-experiments/classification/darknet53/spec.txt -r /workspace/tao-experiments/classification/darknet53 -k nvidia_tao --gpus 8 --use_amp

Perform the post-training accuracy check

After the training is finished, check the log to see the accuracy. You trained VGG16 for 80 epochs, ResNet50 for 120 epochs, ResNet101 for 150 epochs, and EfficientNetB0 for 500 epochs.

Compare the validation accuracy from the model with the open-source networks from Keras Applications, TensorFlow EfficientNet GitHub pages, and the DarkNet ImageNet Classification page. The accuracy results are summarized in the following table. With TAO Toolkit, you can get an even better top-1 accuracy than Keras Applications. You can export the trained model and use it for image classification tasks or use the pretrained weights to train object detection models.

ModelSOTA accuracyTAO Toolkit accuracy
VGG1671.372.8
ResNet5074.974.9
ResNet10176.476.98
EfficientNet B076.776.64
DarkNet5377.277.1
Table 2. Summary of SOTA vs. TAO Toolkit accuracy for classification networks.

Train object detection models on PASCAL VOC and COCO datasets

In this post, you use ImageNet-pretrained weights trained in the prior section as a starting point to train popular object detection models such as Faster R-CNN, SSD, YoloV4, and RetinaNet. We show you how to train the models on the PASCAL VOC and COCO datasets. For each model, we summarize with the TAO Toolkit model accuracy compared to results from the leading literature.

Training Faster R-CNN models

After you have the pretrained models from the first section, use the weights as starting points to train your Faster R-CNN models. In this section, we show how to train a Faster R-CNN model with VGG16 and ResNet101 backbones. You train on the PASCAL VOC dataset for Faster R-CNN – VGG16 model and COCO 2014 for the Faster R-CNN – ResNet101 model.

Train on PASCAL VOC2007 and VOC2012

In this section, you use the PASCAL VOC2007 and VOC2012 datasets and test on the VOC2007 test dataset. Because you are using the VOC07 test set, stick to the convention of using 11-point metric and ignoring difficult objects when you calculate the average precision (AP) metric. For this PASCAL VOC training, you use VGG16 as backbone and use the ImageNet-pretrained VGG16 classification model from the previous section as the pretrained weights.

Convert KITTI text labels to TFRecords

Before the training, you must convert the KITTI text labels to TFRecords using the dataset-convert tool provided by TAO Toolkit. You use a spec file to configure parameters. Download two versions of the dataset_spec file for training/validation and testing. This command takes the KITTI text labels and a dataset spec file as input and produces TFRecords as output. Use the TFRecords as the label files for training the Faster R-CNN model.

Run the following command to generate TFRecords from KITTI text labels:

# trainval0712
tao faster_rcnn dataset_convert --gpu_index 0 -d /workspace/tao-experiments/data/voc0712trainval/dataset_spec.txt -o /workspace/tao-experiments/data/voc0712trainval/tfrecords
# test07
tao faster_rcnn dataset_convert --gpu_index 0 -d /workspace/tao-experiments/data/voc07test/dataset_spec.txt -o /workspace/tao-experiments/data/voc07test/tfrecords
Train

After you produce the TFRecords format, use TAO Toolkit commands to launch a Faster R-CNN training job. Like the classification models, we need a training spec file for Faster R-CNN too. We provided the training spec file.

After you have the training spec file, you are ready to run TAO Toolkit training with a single GPU (we used a V100 32-GB in our experiment).

tao faster_rcnn train --gpu_index 0 -e /workspace/tao-experiments/faster_rcnn/pascal_voc/spec.txt
Validate

After the training is finished, check the screen log to see the validation mAP.

==========================================================================================
Class               AP                  precision           recall              RPN_recall         
------------------------------------------------------------------------------------------
aeroplane           0.7521              0.0229              0.9228              0.8907             
...
tvmonitor           0.7655              0.0143              0.9156              0.8809             
------------------------------------------------------------------------------------------
mAP@0.5 = 0.7560
Prune and retrain

TAO Toolkit also features pruning functionality that can prune a model to reduce the number of parameters. Reducing the number of parameters leads to reduction in memory consumption of the model and increase in inference throughput; both extremely important for model deployment. The goal of pruning is not to only reduce the number of parameters but also to retain comparable accuracy as an unpruned network. In our experiment, we pruned the Faster R-CNN model trained on Pascal VOC dataset. The following table compares the accuracy and performance of pruned versus unpruned network on the PASCAL VOC dataset.

 Unpruned NetworkPruned Network
Pruning threshold0.0(unpruned)0.5
Pruning ratio1.00.2785
mAP75.674.63
FPS on T4(FP32/FP16/INT8)12/46/7019/64/98
Table 3. Pruned vs. unpruned network on Faster R-CNN.
  1. Pruning ratio = (# parameters in pruned model) / (# parameters in unpruned model).
  2. FPS tested on NVIDIA T4 GPU with batch size 1, at 600 x 1000 input resolution.

From the table, you can see that you can prune the Faster R-CNN model to about one fourth the model size (in terms of number of parameters)  which results in about 40% increase in frames per second (FPS), while still maintaining accuracy (mAP) within a single point. You should typically experiment with few pruning threshold parameters to see which settings provide the best accuracy and performance tradeoffs.

Train on COCO 2014

You also do training on the COCO 2014 dataset with Faster R-CNN. For this training, use ResNet101 as its backbone and use the previously trained ImageNet-based ResNet101 model as pretrained weights for Faster R-CNN training.

Convert KITTI text labels to TFRecords

As in PASCAL VOC training, you must convert the KITTI text files to TFRecords. You use a dataset spec file to convert the training KITTI text files to TFRecords. Name this dataset spec file as coco14_dataset_spec_train.txt. Similarly, you also need the dataset spec to convert the validation set. Name the latter as coco14_dataset_spec_val.txt. Use the dataset_convert tool to generate TFRecords. Some of the COCO images cannot be loaded by TensorFlow image loader during training. In that case, delete those image/label pairs and regenerate TFRecords.

# training set
mkdir -p /home/<username>/tao-experiments/data/coco2014/tfrecords/train2014
tao faster_rcnn dataset_convert --gpu_index 0 -d /workspace/tao-experiments/data/coco2014/coco14_dataset_spec_train.txt -o /workspace/tao-experiments/data/coco2014/tfrecords/train2014/train
# validation set
mkdir -p /home/<username>/tao-experiments/data/coco2014/tfrecords/val2014
tao faster_rcnn dataset_convert --gpu_index 0 -d /workspace/tao-experiments/data/coco2014/coco14_dataset_spec_val.txt -o /workspace/tao-experiments/data/coco2014/tfrecords/val2014/val
Train

When the dataset is ready, you need a spec file to configure our COCO training, same as in PASCAL VOC training. Download the spec file for COCO 2014 training: spec.txt. Start the TAO Toolkit training with a single GPU (we used a V100 32-GB GPU).

tao faster_rcnn train --gpu_index 0 -e /workspace/tao-experiments/coco2014/spec.txt
Validate

When training is complete, check the screen log to see its accuracy on the COCO 2014 validation set. There are two metrics commonly used in COCO, the first one is the ordinary mAP with IoU=0.5, while the other one is the average of mAP when IoU changes from 0.5 to 0.95, with a step size of 0.05. The first metric is called mAP@0.5 and the other one is mAP@[0.5:0.95]. TAO Toolkit prints both metrics, along with each mAP value at any IoU in [0.5, 0.95].

The mAP results look like the following:

==========================================================================================
Class               AP                  precision           recall              RPN_recall         
------------------------------------------------------------------------------------------
airplane            0.6976              0.0655              0.8394              0.8055             
...
mAP@0.5 = 0.4857             
...
mAP@[0.5:0.9500000067055225] = 0.2700

The TAO Toolkit Faster R-CNN model on COCO 2014 can reach mAP@0.5 of 0.4857 and mAP@[0.5:0.95] of 0.27.

Summary of Faster R-CNN mAP

Table 4 summarizes the TAO Toolkit-based Faster R-CNN mAP on PASCAL VOC and COCO 2014 dataset and compares them with numbers from the literature. You can see that the TAO Toolkit-based Faster R-CNN model can reach or even exceed SOTA accuracy on both PASCAL VOC and COCO2014 datasets.

ModelDatasetSOTA accuracyTAO Toolkit accuracy
Faster R-CNN VGG16PASCAL VOC 071273.275.6
Faster R-CNN ResNet101COCO 201448.4/27.248.57/27.0
Table 4. Summary of Faster R-CNN SOTA vs. TAO Toolkit accuracy.

Training SSD models

In this section, we walk you through the steps to train the SSD object detection model to achieve highest accuracy on Pascal VOC 07 test dataset and COCO 2017 validation dataset using the ImageNet-pretrained VGG16 backbone.

Train on PASCAL VOC2007 and VOC2012

Follow the steps in Prepare the datasets to get KITTI format labels. After you have the PASCAL VOC dataset in KITTI format, you use images and its corresponding labels directly for SSD training.

Prepare the training spec file

We provided a spec file for training SSD models with input size 300×300 on PASCAL VOC dataset. The model’s backbone is ImageNet-pretrained VGG16. Name this model as SSD_VGG16_300X300. You train the SSD_VGG16_300X300 for 240 epochs with batch_size=32. The optimizer is SGD with 0.9 momentum with a sophisticated learning rate scheduler. You also apply L2 regularization to the training with 0.0005 regularizing weights.

Train

Run the following command to trigger the training:

tao ssd train --gpu_index 0 -e /workspace/tao-experiments/ssd/ssd_vgg16_voc.txt -m /workspace/tao-experiments/classification/vgg16/weights/vgg_080.tao -k nvidia_tao -r /workspace/tao-experiments/ssd/voc/

It takes around 21 hours to finish the whole training on Pascal VOC dataset with a single V100 GPU.

Validate

After training, check the screen log to see the mAP on the validation set. The evaluation metric is VOC2007 11-point sampled mAP with a matching IoU threshold of 0.5. Evaluate the model on the VOC07 test dataset. The following is a sample evaluation log of an SSD_VGG16_300X300 model:

*******************************
aeroplane     AP    0.778
...
tvmonitor     AP    0.774
              mAP   0.776
*******************************

The model’s mAP based on the log is 77.6% with a matching IoU threshold of 0.5. Because the training dataset is shuffled every epoch with SSD-style augmentation during the training, the final mAP number is not the same as what’s shown. However, the gap should be within 1%.

Train on COCO 2017

You can also do training on the COCO 2014 dataset with Faster R-CNN. For this training, use ResNet101 as the backbone and use the previously trained ImageNet-based ResNet101 model as pretrained weights.

Prepare the training spec file

You also use SSD_VGG16_300x300 to train on the COCO 2017 dataset. The anchor scales range in SSD config for COCO 2017 is wider than that in config for Pascal VOC because the range of objects spatial sizes is larger. You train the SSD_VGG16_300X300 for 95 epochs with batch_size=32. The optimizer is SGD with 0.9 momentum with a sophisticated learning rate scheduler. You also apply L2 regularization to the training with 0.0005 regularizing weights. Download the spec file: ssd_vgg16_coco17.txt.

Train

Run the following command to trigger the training:

tao ssd train --gpu_index 0 -e /workspace/tao-experiments/ssd/ssd_vgg16_coco17.txt -m /workspace/tao-experiments/classification/vgg16/weights/vgg_080.tao -k nvidia_tao -r /workspace/tao-experiments/ssd/coco17

It takes around 90 hours to finish the whole training on COCO17 dataset with a single V100 GPU.

Validate

After training, check the screen log to see the mAP on the validation set. The evaluation metric is integration mAP with a matching IoU threshold of 0.5. Evaluate the model on the COCO2017 validation dataset. The following is a sample evaluation log of an SSD_VGG16_300X300 model:

*******************************
airplane      AP    0.759
...
zebra         AP    0.807
              mAP   0.431
*******************************

To get the mAP@[0.5:0.95], evaluate the model with different matching IoU threshold by changing matching_iou_threshold in evaluation_config then average the mAP.

Summary of SSD mAP

Table 5 summarizes the TAO Toolkit SSD mAP on PASCAL VOC and COCO2017 and compares them with state-of-the-art results. You can see that the TAO Toolkit-based SSD model can reach or even exceed SOTA accuracy on both PASCAL VOC and COCO2017 datasets.

ModelDatasetSOTA accuracyTAO Toolkit accuracy
SSD_VGG16_300X300PASCAL VOC 071277.277.6
SSD_VGG16_300X300COCO201742.9/25.143.1/26.1
Table 5. Summary of SSD SOTA vs. TAO Toolkit accuracy.

Like Faster R-CNN model in the previous section, TAO Toolkit can also prune the SSD models to smaller sizes without sacrificing the accuracy. Tables 6 and 7 show the pruned model’s accuracy and inference throughput compared to the unpruned models. We used the pruning threshold 0.625 and 0.6 on Pascal VOC and on the COCO dataset, respectively. Retraining is conducted after pruning with the same config file as training.

 Unpruned NetworkPruned Network
Pruning threshold0.0(unpruned)0.625
Pruning ratio1.00.79
mAP77.677.5
FPS on T4 (FP32)91115
Table 6. Unpruned vs. pruned network SSD on the PASCAL VOC2007/VOC2012 datasets.
 Unpruned NetworkPruned Network
Pruning threshold0.0(unpruned)0.6
Pruning ratio1.00.74
mAP43.143.2
FPS on T4 (FP32)6791
Table 7. Unpruned vs. pruned network SSD on the COCO2017 dataset.

With pruning, you can reduce the overall size of the model by 20% to 25% which provides a 25% to 35% increase in inference throughput on T4 GPU at FP32 precision. The FPS (frames/sec) number was generated using trtexec with TensorRT 7.2 with batch size 1.

Training a RetinaNet model

In this section, we walk you through the steps to reach RetinaNet’s state-of-the-art mAP on the COCO2017 validation dataset with the ImageNet-pretrained ResNet50 backbone.

Train on COCO 2017

Follow the steps in Prepare the datasets to get KITTI format labels. After you have the COCO2017 dataset in KITTI format, use images and its corresponding labels directly for RetinaNet training.

Prepare the training spec file

In this example, follow the same gamma (2), alpha (0.25), scales (3) and aspect ratio (1, 2, 0.5) settings as shown in Table 1(b)(c) of the RetinaNet paper, Focal Loss for Dense Object Detection. The input size is set to 608 * 608, as width and height should be a multiplier of 32. Download the full spec file for RetinaNet model training.

Train

Run the following command to trigger the training:

tao retinanet train --gpu_index 0 -e /workspace/tao-experiments/retinanet/coco/spec.txt
 -k <pretrained_model_key> -r /workspace/tao-experiments/retinanet/coco
Validate

After training, check the screen log to see the mAP on the validation set.

The evaluation metric is integration mAP with a matching IoU threshold of 0.5. Evaluate the model on the COCO2017 validation dataset. The following is a sample evaluation log of a RetinaNet ResNet50 model:

*******************************
airplane      AP    0.83
...
zebra         AP    0.868
              mAP   0.529
*******************************

To get the mAP@[0.5:0.95], evaluate the model with a different matching IoU threshold by changing matching_iou_threshold in evaluation_config then average the mAP.

Summary of RetinaNet mAP

The following table summarizes the TAO Toolkit RetinaNet mAP on COCO2017 and compares it with state-of-the-art results.

ModelDatasetSOTA accuracyTAO Toolkit accuracy
RetinaNet ResNet50COCO201752.552.9
Table 8. Summary of RetinaNet SOTA vs. TAO Toolkit accuracy.

Training YOLOv3 models

In this section, we walk you through the steps to reach YOLOv3’s state-of-the-art mAP on the Pascal VOC2007 test dataset and COCO2014 validation dataset with the ImageNet-pretrained DarkNet53 backbone.

Train on PASCAL VOC2007 and VOC2012

First, convert the dataset to KITTI format labels from the Prepare the dataset section. After you have the PASCAL VOC dataset in KITTI format, instead of generating TFrecord, use the images and its corresponding labels directly for YOLOv3 training. To compare with the SOTA model, exclude all difficult boxes from labels. The SOTA label is generated with the voc_label.py Python script, which excludes difficult boxes. To exclude difficult boxes, after you prepare the dataset, remove all bounding boxes, both in the training and validation set, that are difficult from the KITTI labels. If you use the xml_to_kitti.py script that we provided and set argument -d , that puts the occlusion parameter over difficult boxes. You can write a script to remove all such boxes.

Prepare the training spec file

We provided a spec for training YOLOv3 models with input size 416*416 on PASCAL VOC dataset. The model’s backbone is ImageNet-pretrained DarkNet53. Download the full spec file for YOLOv3 Pascal VOC: yolov3/v3_voc.txt. After you download this spec file, be sure to replace the pretrain_model_path value with the path to DarkNet53 ImageNet classification model that you trained with the highest validation accuracy.

Train

Run the following command to trigger the training:

tao yolo_v3 train --gpu_index 0 -e /workspace/tao-experiments/yolo_v3/coco/spec.txt
 -k nvidia_tao -r /workspace/tao-experiments/yolo_v3/coco
Validate

After training, check the screen log to see the mAP on the validation set. The evaluation metric is VOC07 11-point sampled mAP with a matching IoU threshold of 0.5. Evaluate the model on the VOC07 test dataset. The following is a sample evaluation log of a YOLOv3 model:

*******************************
aeroplane     AP    0.830
...
tvmonitor     AP    0.747
              mAP   0.753
*******************************

Based on the log, the model mAP is 75.3% with a matching IoU threshold of 0.5. Because the training dataset is shuffled every epoch during the training, SSD-style augmentation introduces the random pattern of the images. The final mAP number is not the same with the spec file provided. However, the gap should be within 1%.

Training on the COCO 2014 dataset

In this section, we walk you through the steps to reach YOLOv3’s state-of-the-art mAP on the COCO 2014 validation dataset with the ImageNet-pretrained DarkNet53 backbone.

Preparing COCO 2014 dataset

Download the COCO 2014 dataset from the COCO website. To compare with the SOTA model, do the training/testing split the same way as the original author. Also, the author’s training/validation split is different from the COCO 2014 official training/validation split and can be reproduced by the get_coco_dataset.sh bash file.

Using the bash file, get 5k.txt and no5k.txt. Those are the file names for validation and training images/labels. After preparing the data following the COCO 2014 data preparation section, merge the original training/validation set and re-split it according to those two files. The re-split training and validation set is used as the training/validation set for YOLOv3. This re-split procedure is important to reproduce the SOTA results.

Prepare the training spec file

We provided a spec for training YOLOv3 models with input size 416*416 on the COCO 2014 dataset. The model’s backbone is the ImageNet-pretrained DarkNet53. Download the full spec file for YOLOv3 COCO14: v3_coco.txt. After you download this spec file, be sure to replace the pretrain_model_path value with the path to DarkNet53 ImageNet classification model that you trained with the highest validation accuracy.

Train

Run the following command to trigger the training:

tao yolo_v3 train -e /workspace/tao-experiments/yolo_v4/coco/spec.txt
 -k nvidia_tao -r /workspace/tao-experiments/yolo_v4/coco
Validate

After training, check the screen log to see the mAP on the validation set. The evaluation metric is 11-points sample mAP with a matching IoU threshold of 0.5. Evaluate the model on the COCO2014 (DarkNet split) validation dataset. The following is a sample evaluation log of a YOLOv3 model:

*******************************
airplane      AP    0.812
...
zebra         AP    0.814
              mAP   0.523
*******************************

Summary of YOLOv3 mAP

Here’s the accuracy for TAO Toolkit YOLOv3 mAP on PASCAL VOC and COCO2014 compared to the state-of-the-art results.

ModelDatasetSOTA accuracyTAO Toolkit accuracy
YOLOv3_DarkNet53PASCAL VOC2007/VOC201275.4%75.3%
YOLOv3_DarkNet53COCO2014 (DarkNet split)54.2%52.3%
Table 9. Summary of YOLOv3 SOTA vs. TAO Toolkit accuracy.

Summary

In this post, you learned about using TAO Toolkit to create AI models that achieve comparable or higher accuracies on several popular deep learning architectures as compared to the literature. TAO Toolkit is a zero-coding, AI training toolkit that makes it easy for anyone to get started with training custom AI models.

With TAO Toolkit, you do not have to spend months learning deep learning frameworks to create a model or struggle with optimizing models for deployment.

TAO Toolkit supports all the popular DL architectures such as ResNet, Faster R-CNN, SSD, YoloV3/V4, and others. As noted in our experiments, the accuracy achieved by the toolkit is comparable and sometimes higher than what is presented in literature. In addition to ease of use and flexibility, TAO Toolkit also provides features such as model pruning and INT8 quantization, which can optimize the model for inference without sacrificing accuracy. Pruning and INT8 quantization is supported on all image classification and object detection networks.

For more information, see the following resources: