Training Instance Segmentation Models Using Mask R-CNN on the NVIDIA Transfer Learning Toolkit

To convert pixels to actionable insights, computer vision relies on deep learning to provide an understanding of the environment. Object detection is a commonly used technique to identify individual objects in a frame such as to identify people or cars. While object detection is beneficial for certain applications, it falls short when you want to understand objects at a pixel level.

Instance segmentation is a popular computer vision technique that helps to identify each instance of multiple objects in a frame at the pixel level. In addition to the bounding boxes, instance segmentation also creates a fine-grained segmentation mask. Segmentation helps delineate between objects and background, such as in an AI-powered green screen, where you want to blur or change the background of the frame or segmenting roads or sky in the frame. Or you may want to pin-point manufacturing defects or nuclei segmentation in microscopy. Figure 1 shows an example of segmentation masks over detected objects.

This image shows the bounding box and segmentation mask generated for the five cars detected on the streets.
Figure 1. Instance segmentation: segmentation masks over detected objects.

Transfer learning is a common practice in training specialized deep neural network (DNN) models. Transfer learning is made easier with NVIDIA Transfer Learning Toolkit (TLT), a zero-coding framework to train accurate and optimized DNN models. With the release of TLT 2.0, NVIDIA added training support for instance segmentation, using Mask R-CNN. You can train Mask R-CNN models using one of the several ResNet backbones. Pretrained weights trained for ResNet10/18/50/101 are provided in the NGC and can be used as a starting point for transfer learning.

In this post, I show you how to train a 90-class COCO Mask R-CNN model with TLT and deploy it on the NVIDIA DeepStream SDK using TensorRT. You learn how to access and use pretrained models from NGC, train a Mask R-CNN model with minimal effort, and deploy it for inference on a GPU. These steps can be used to build any custom Mask R-CNN model.

Mask R-CNN is natively integrated with the DeepStream SDK, a streaming analytic toolkit for building intelligent video analytic applications. For more information about how Mask R-CNN integrates with DeepStream, see Building Intelligent Video Analytics Apps Using NVIDIA DeepStream 5.0 (Updated for GA)

Training a Mask R-CNN model using COCO

Mask R-CNN is a two-stage, object detection and segmentation model introduced in 2017. It’s an excellent architecture due to its modular design and is suitable for various applications. In this section, I walk you through reproducible steps to take pretrained models from NGC and an open-source COCO dataset and then train and evaluate the model using TLT.

To get started, set up a NVIDIA NGC account and then pull the TLT container:

docker pull

Next, download the pretrained model. Use the NGC command to list the available models:

ngc registry model list nvidia/tlt_instance_segmentation:*

To download the desired model, use the following command. For this post, I used the ResNet50 backbone, but you are free to use any of the supported backbones.

ngc registry model download-version nvidia/tlt_instance_segmentation:resnet50 --dest $model_path

The entire workflow consists of the following steps:

  1. Preparing the data.
  2. Configuring the spec file.
  3. Training the model.
  4. Verifying the model.
  5. Exporting the model.
  6. Deploying with DeepStream.

Preparing the data

Mask R-CNN expects a directory of images for training and validation and annotation in COCO format. TFRecords is used to manage the data and help iterate faster. To download the COCO dataset and convert it to TFRecords, the Mask R-CNN iPython notebook in the TLT container provides a script called If you are using a custom dataset, you must convert the annotation to COCO before using it with TLT. For more information, see COCO data format.

Download the COCO dataset and convert to TFRecords:

bash $DATA_DIR

This downloads the raw COCO17 dataset and converts it to TFRecords in $DATA_DIR.

Configuring the spec file

The next step is to configure the spec file for training. The experiment spec file is essential, as it compiles all the necessary hyperparameters for achieving a good model. The Mask R-CNN spec file has three major components: top-level experiment configs, data_config, and maskrcnn_config. The format of the spec file is a protobuf text (prototxt) message, and each of its fields can be either a basic data type or a nested message.

The top-level experiment configs include basic parameters for an experiment; for example, learning rate, number of iterations, whether to use mixed-precision training, and so on. An encrypted checkpoint is saved per the num_steps_per_eval value and then you run evaluation on the validation set.

The init_learning_rate value is set here for an 8-GPU training job. If you are using a different number of GPUs, adjust the learning rate following the linear scaling rule.

use_amp: False
warmup_steps: 1000
learning_rate_steps: "[60000, 80000, 90000]"
learning_rate_decay_levels: "[0.1, 0.01, 0.001]"
total_steps: 100000
train_batch_size: 3
eval_batch_size: 8
num_steps_per_eval: 10000
momentum: 0.9
l2_weight_decay: 0.00002
warmup_learning_rate: 0.0001
init_learning_rate: 0.02

The data_config value specifies the input data source and dimension. The augment_input_data is only applied during training and is recommended to achieve higher accuracy. The num_classes value is the number of categories in your ground truth plus 1 for the background class. Input images are resized and padded to image_size while keeping the aspect ratio.

        image_size: "(832, 1344)"
        augment_input_data: True
        eval_samples: 5000
        training_file_pattern: "
        validation_file_pattern: "$DATA_DIR/val*.tfrecord"
        val_json_file: "$DATA_DIR/annotations/instances_val2017.json"
        num_classes: 91
        skip_crowd_during_training: True

The maskrcnn_config value specifies the model structure and loss function–related hyperparameters. Currently, Mask R-CNN supports all ResNet backbones in TLT. In this experiment, you choose ResNet50 as the backbone, its first two convolutional blocks frozen, and all batch normalization (BN) layers frozen, as specified with freeze_bn: True and freeze_blocks: "[0,1]". With a frozen convolutional layer, the weights do not change in a downstream task. This is especially helpful in transfer learning, where generic features are already captured in the shallow layers. You not only reuse the learned features, but also reduce training time. For more information about each field, see the TLT Getting Started Guide.

maskrcnn_config {
        nlayers: 50
        arch: "resnet"
        freeze_bn: True
        freeze_blocks: "[0,1]"
        gt_mask_size: 112
        # Region Proposal Network
        rpn_positive_overlap: 0.7
        rpn_negative_overlap: 0.3
        rpn_batch_size_per_im: 256
        rpn_fg_fraction: 0.5
        rpn_min_size: 0.
        # Proposal layer.
        batch_size_per_im: 512
        fg_fraction: 0.25
        fg_thresh: 0.5
        bg_thresh_hi: 0.5
        bg_thresh_lo: 0.
        # Faster-RCNN heads.
        fast_rcnn_mlp_head_dim: 1024
        bbox_reg_weights: "(10., 10., 5., 5.)"
        # Mask-RCNN heads.
        include_mask: True
        mrcnn_resolution: 28
        # training
        train_rpn_pre_nms_topn: 2000
        train_rpn_post_nms_topn: 1000
        train_rpn_nms_threshold: 0.7
        # evaluation
        test_detections_per_image: 100
        test_nms: 0.5
        test_rpn_pre_nms_topn: 1000
        test_rpn_post_nms_topn: 1000
        test_rpn_nms_thresh: 0.7
        # model architecture
        min_level: 2
        max_level: 6
        num_scales: 1
        aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
        anchor_scale: 8
        # localization loss
        rpn_box_loss_weight: 1.0
        fast_rcnn_box_loss_weight: 1.0
        mrcnn_weight_loss_mask: 1.0

Training the model

When the data and spec file are ready, you can start the training with the following command:

tlt-train mask_rcnn -e $spec_file_path -r $experiment_dir -k $KEY --gpus N

Training with more GPUs allows networks to ingest more data faster, saving you precious time during the development process. TLT supports multi-GPU training so that you can train the model with several GPUs in parallel. If automatic mixed-precision (AMP) training is enabled by setting enable_amp to True, you can expect a 20–50% speed boost compared to F32 training. During training, a detailed log records training loss every five iterations and evaluation metrics on the validation set.

throughput: 34.4 samples/sec
==================== Metrics =====================
FastRCNN box loss: 0.27979
FastRCNN class loss: 0.11633
FastRCNN total loss: 0.39612
L2 loss: 0.83087
Learning rate: 0.00014
Mask loss: 1.3277
RPN box loss: 0.03868
RPN score loss: 0.60576
RPN total loss: 0.64443
Total loss: 3.19912

If, for any reason, the training process is interrupted, you can resume training by executing the same command. It automatically picks up from the last saved checkpoint.

Evaluating the model

To evaluate the model, use the following command:

tlt-evaluate mask_rcnn -e $spec_file_path -m $model_path -k $KEY

Mask R-CNN reports COCO’s detection evaluation metrics. For example, AP50 means the average precision (AP) when IoU is set at 50%.

All the detection frameworks use mAP as a shared metric, adopted from Pascal VOC and comparable to AP50. The classification model enables various metrics, including Top K Accuracy, Precision and Recall, and Confusion Matrix.

After training for 100K iterations using 8 GPUs, you can observe the following metrics:

=========== Metrics ===========
AP: 0.334154785
AP50: 0.539312243
AP75: 0.358969182
APl: 0.453923374
APm: 0.354732722
APs: 0.181649670
ARl: 0.661920488
ARm: 0.533207536
ARmax1: 0.297426522
ARmax10: 0.477609098
ARmax100: 0.503548384
ARs: 0.317135185
mask_AP: 0.307278961
mask_AP50: 0.505144179
mask_AP75: 0.325496018
mask_APl: 0.432014465
mask_APm: 0.327025950
mask_APs: 0.151430994
mask_ARl: 0.626315355
mask_ARm: 0.492682129
mask_ARmax1: 0.281772077
mask_ARmax10: 0.439913362
mask_ARmax100: 0.461205393
mask_ARs: 0.271702766

The KPI is obtained by fine-tuning the pretrained model from NGC, which was originally trained on a subset of the Open Images dataset. If you train using ImageNet pretrained weights or train with larger iterations, the KPI might be different.

Verifying the model

Now that you’ve trained the model, run inference and verify the predictions. To verify the model visually with TLT, use the tlt-infer command. The tlt-infer command supports inference on the .tlt model, as well as the TensorRT engine. tlt-infer generates annotated images with bounding boxes rendered on them. Optionally, you can also visualize the segmentation masks or serialize the output metadata in COCO JSON format. For example, to run an inference with a .tlt file, run the following command:

tlt-infer mask_rcnn  -i $input_images_dir  -o $annotated_images_dir -e $spec_file -m $tlt_model -l $json_label -t $threshold  --include_mask 

Figure 2 shows an example of the original image compared to the annotated output shown in Figure 3. As you can see, the model is robust to an image that is different from the COCO training data.

This image shows a scene on the street with cars before we run the inference.
Figure 2. The original image.
This annotated image shows a scene on the street with cars with predicted bounding boxes and instance masks generated after we run the inference.
Figure 3. Annotated output of the original image: inference using the trained model.

Exporting the model

Inference throughput and how quickly you can create an efficient model are two key metrics for deploying deep learning applications because they directly affect the time to market and the cost of deployment. TLT includes a tlt-export command to export and prepare TLT models for deployment. The tlt-export command can optionally generate the calibration cache for running inference at INT8 precision. For more information, see Deploying Deep Neural Networks with NVIDIA TensorRT.

The model is exported as a .etlt (encrypted TLT) file. The file is consumable by the DeepStream SDK, which decrypts the model and converts it to a TensorRT engine. Exporting the model decouples the training process from inference and allows conversion to TensorRT engines outside the TLT environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment.

For example, to export the model in INT8, use the following command:

tlt-export mask_rcnn -m $model_path -o $int8_etlt_file -e $spec_file -k $KEY --cal_image_dir $calibration_image_dir --batch_size N  --batches $num_cal_batches --cal_cache_file $calibration_table --cal_data_file $calibration_data_cache --data_type int8

This generates an INT8 calibration table and .etlt file. To quantize the model to INT8, you must provide a dataset on which to calibrate, which is provided by the --cal_image_dir and --cal_data_file arguments. These arguments specify the image directory and the tensorfile needed for calibration. The number of batches in the tensorfile is obtained from batches and batch_size values. Make sure that the directory mentioned in --cal_image_dir has at least (batch_size * batches) images in it.

Deploying with DeepStream

Integrating a Mask R-CNN model in DeepStream is straightforward, as DeepStream 5.0 supports instance segmentation networks by default. The configuration file and label file for the model are provided in the SDK. These files can be used with the generated model as well as your own trained model. A sample Mask R-CNN model trained on a one-class dataset is provided in GitHub. The config and label file should work by default for that model. For the model that you trained in this post, minor modifications are required.

Start by downloading and installing the DeepStream SDK. The instructions to run TLT models with DeepStream are provided in the README:


Here are the key config files to run the Mask R-CNN model:



The /deepstream_app_source1_mrcnn.txt file is the main config used by deepstream-app. This file configures the parameter for the entire video analytic pipeline. For more information, see Reference Application Configuration. For more information about Mask R-CNN inference pipeline in DeepStream, see Building Intelligent Video Analytics Apps Using NVIDIA DeepStream 5.0 (Updated for GA).

The /config_infer_primary_mrcnn.txt file is an inference config file that sets parameters for Mask R-CNN inference. This file is called by the main deepstream_app_source1_mrcnn.txt config. Here are the key parameters that you modify based on your model:

num-detected-classes=<# of classes if different than default>

Here is an example:

network-type=3 ## 3 is for instance segmentation network
num-detected-classes=<# of classes if different than default>
0=FP32, 1=INT8, 2=FP16 mode
no cluster
0=Group Rectangles, 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
MRCNN supports only cluster-mode=4; Clustering is done by the model itself

This runs on the clip that is provided in the SDK. To try on your own source, modify [source0] in /deepstream_app_source1_mrcnn.txt.

Figure 4 shows the end-to-end performance that you can expect with deepstream-app across various platforms. The performance is measured in frames per sec (FPS) that are processed by deepstream-app.

  • Ran with inference resolution of 1344×832
  • On NVIDIA Jetson Nano and on DLAs, this is run with a batch size of 1.
  • On Jetson AGX Xavier and Xavier NX, this is run with a batch size of 2.
  • On T4, this is run with a batch size of 4.
Figure 4. Performance of the Mask R-CNN model with DeepStream.


In this post, you learned about training instance segmentation models using the Mask R-CNN architecture with the TLT. The post showed taking an open-source COCO dataset with one of the pretrained models from NGC and training and optimizing with TLT to deploying the model on the edge using the DeepStream SDK.

You can apply these steps to train and deploy your own custom networks. Training can be done on multiple GPUs to parallelize the runs and train faster. You can also generate INT8 calibration files to run inference at INT8 precision. Running at INT8 precision improves inference performance on your edge devices.

For more information, see the following resources: