Simulation / Modeling / Design

Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server

If you’re building unique AI/DL application, you are constantly looking to train and deploy AI models from various frameworks like TensorFlow, PyTorch, TensorRT, and others quickly and effectively. Whether it’s deployment using the cloud, datacenters, or the edge, NVIDIA Triton Inference Server enables developers to deploy trained models from any major framework such as TensorFlow, TensorRT, PyTorch, ONNX-Runtime, and even custom framework backends.

DeepStream is a toolkit to build scalable AI solutions for streaming video. You can take a trained model from a framework of your choice and directly run inference on streaming video with DeepStream. You can use a vast array of IoT features and hardware acceleration from DeepStream in your application. With the native Triton Server integration, you can quickly prototype any deep learning model with minimal effort. For more information about Triton integration and other key capabilities of DeepStream, see Building Intelligent Video Analytics Apps Using NVIDIA DeepStream 5.0.

In this post, we walk you through how to deploy an open source model with minimal configuration on DeepStream using Triton. We use the TensorFlow FasterRCNN-InceptionV2 model from the TensorFlow Model Zoo. We also show several optimizations that you can leverage to improve application performance. The steps outlined in this tutorial can be applied to other open-source models as well with minor changes.


We’ve introduced Triton integration to DeepStream 5.0. To get started, make sure to use the deepstream:5.0-20.07-triton container from NVIDIA NGC when using an NVIDIA GPU on the x86 platform. Triton with DeepStream on x86 only works with -triton containers.

For the NVIDIA Jetson platform, Triton shared libraries come preinstalled as part of DeepStream. It can be used from any deepstream-l4t:5.0-20.07 container.

For this post, we assume that you have DeepStream is installed at $DEEPSTREAM_DIR. The actual installation directory depends on whether you are using a container or the bare-metal version of DeepStream.

Deploying object detection models on DeepStream

You are going to take the FasterRCNN detection model from TensorFlow Model Zoo and create a DeepStream pipeline to deploy this model on an NVIDIA GPU for object detection.

For this post, you use the faster_rcnn_inception_v2_coco_2018_01_28 model on the NVIDIA Jetson and NVIDIA T4. Triton allows you to use the TensorFlow Graphdef file directly. These are the detailed steps for deploying the TensorFlow frozen GraphDef file:

  1. Download the model and labels.
  2. Create the Triton configuration file.
  3. Create the DeepStream configuration.
  4. Build a custom parser.
  5. Run the DeepStream app.

Step 1: Download the model and labels

Obtain the TensorFlow model and extract it. Create a directory for the model in the Triton model repository. Move the extracted frozen GraphDef file into this directory:

tar xvf faster_rcnn_inception_v2_coco_2018_01_28.tar.gz
cd $DEEPSTREAM_DIR/samples/trtis_model_repo
mkdir faster_rcnn_inception_v2 && cd faster_rcnn_inception_v2 && mkdir 1
cp frozen_inference_graph.pb 1/model.graphdef

Create a labels.txt file that includes all the names of labels used in the training dataset. For this post, use the coco dataset labels:


Make sure that your model directory is laid out in the correct format before proceeding:

├── 1
│   └── model.graphdef
├── config.pbtxt
└── labels.txt

Step 2: Create the Triton configuration file

Create a model configuration file that includes information about the input tensor to the network, the names, shapes, and data types of the output tensor nodes, and other information about the network needed by Triton. For more information, see Triton Model Configuration.

The key parameters in this configuration are platform, input, and output. For this post, use the tensorflow_graphdef option because you are using a TensorFlow GraphDef model. The other options are tensorrt_plan, tensorflow_graphdef, tensorflow_savedmodel, caffe2_netdef, onnxruntime_onnx, pytorch_libtorch, or custom.

The output section of the configuration file provides Triton with the name, shape, and size of the network’s output nodes. These vary depending on the model and must change if you use a different model. Some tools that you can use to explore your model and learn about its inputs and outputs are Netron and TensorBoard. Create the config file, config.pbtxt, and save it as shown in the directory tree earlier.

The following code example shows the model configuration file:

name: "faster_rcnn_inception_v2"
platform: "tensorflow_graphdef"
max_batch_size: 1
input [
    name: "image_tensor"
    data_type: TYPE_UINT8
    format: FORMAT_NHWC
    dims: [ 600, 1024, 3 ]
output [
    name: "detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4]
    reshape { shape: [100,4] }
    name: "detection_classes"
    data_type: TYPE_FP32
    dims: [ 100 ]
    name: "detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
    name: "num_detections"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape { shape: [] }
# Specify GPU instance.
instance_group {
  kind: KIND_GPU
  count: 1
  gpus: 0

Step 3: Create the DeepStream configuration

The DeepStream reference app requires two configuration files for this model: the DeepStream application config file, which sets various parameters for the reference app, and the inference config file, which sets inference specific hyperparameters for the chosen network. Place these files as follows:

├── config_infer_primary_faster_rcnn_inception_v2.txt
├── source1_primary_faster_rcnn_inception_v2.txt

The Deepstream configuration file source1_primary_faster_rcnn_inception_v2.txt is a top-level configuration file that changes the properties and behaviors of components in the pipeline or toggles them on and off. For more information about this file, see the DeepStream reference application configuration. The inference configuration file is referenced under the [primary-gie] section using the config-file option. For this example, the DeepStream configuration file can be downloaded from the NVIDIA-AI-IOT/deepstream_triton_model_deploy GitHub repo.

The inference configuration file, config_infer_primary_faster_rcnn_inception_v2.txt allows you to provide the inference plugin information regarding the inputs, outputs, preprocessing, post-processing, and communication facilities required by the application. For more information, see the GStreamer gst-nvinferserver plugin development guide. The following code example shows an inference configuration file:

infer_config {
  unique_id: 1
  gpu_ids: [0]
  max_batch_size: 1
  backend {
    trt_is {
      model_name: "faster_rcnn_inception_v2"
      version: -1
      model_repo {
        root: "../../trtis_model_repo"
        log_level: 2
        tf_gpu_memory_fraction: 0
        tf_disable_soft_placement: 0
  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_NONE
    maintain_aspect_ratio: 0
    frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
    frame_scaling_filter: 1
    normalize {
      scale_factor: 1.0
      channel_offsets: [0, 0, 0]
  postprocess {
    labelfile_path: "../../trtis_model_repo/faster_rcnn_inception_v2/labels.txt"
    detection {
      num_detected_classes: 91
      custom_parse_bbox_func: "NvDsInferParseCustomTfSSD"
      nms {
        confidence_threshold: 0.3
        iou_threshold: 0.6
        topk : 100
  extra {
    copy_input_to_host_buffers: false
  custom_lib {
    path: "/opt/nvidia/deepstream/deepstream-5.0/lib/"
input_control {
  interval: 0
output_control {
  detect_control {
    default_filter { bbox_filter { min_width: 32, min_height: 32 } }

There are a few parameters to specify for running this or any model with DeepStream:

  • preprocessConverts the input frame to the right format and resolution.
  • infer-config—Specifies the batch size for inference and the directory location of the Triton config file created in step 2.
  • postprocess—Specifies how to parse inference results from TensorFlow to actual insights. In this case, this is parsing output tensors to create bounding boxes. For this model, you use a custom parser called NvDsInferParseCustomTfSSD from a shared library specified in the [custom_lib] section of the infer-config parameter.

Step 4: Build a custom parser

To obtain the bounding boxes from the output nodes of the network, you must pass the output to a post-processing parser. DeepStream provides some sample parsers for bounding box detection and image classification. For this example, the outputs nodes are detection_boxes, detection_classes, detection_scores, and num_detections. Because the parsing is like the TensorFlow SSD model that is provided as an example with DeepStream SDK, the sample post-processing parser for that model can also parse your FasterRCNN-InceptionV2 model output as well. For more information, see the TensorFlow Faster RCNN meta architecture.


To run a different model, you must change the output parser. To add your own post-processing parser, do the following steps.

Add a post-processing function in the following file. You can refer to the other parsers in the same file.


Next, build it using Makefile:

cd $DEEPSTREAM_DIR/sources/libs/nvdsinfer_customparser
make all

This creates a post-processing shared library called You can then either copy this binary to $DEEPSTREAM_DIR/lib or provide the absolute path to it under the custom_lib attribute of the inference config file from earlier. You might also have to change detection-specific parameters such as the number of detected classes, the name of your custom post-processing parsing function, and the method to use for clustering the bounding boxes. Check the inference config file.

Step 5: Run the DeepStream app

Now that you have everything in place, you can finally use deepstream-app to deploy the model:

cd $DEEPSTREAM_DIR/samples/configs/deepstream-app-trtis
deepstream-app -c source1_primary_faster_rcnn_inception_v2.txt

This opens a window that shows the sample video with the detected bounding boxes around cars, traffic lights, and so on.

Image of a road and intersection. Bounding boxes around the cars and traffic lights.
Figure 1. A deepstream-app window showing detected cars, traffic lights, and so on.

Optimizing the model

Now that the model is deployed, look at optimizing the network to achieve higher FPS or infer more concurrent streams. Triton allows you to use TensorFlow-TensorRT. TensorRT performs several important transformations and optimizations to the neural network graph, such as removing layers with unused outputs, layer fusion, enabling mixed precision, and more.

Use the following code examples to optimize your TensorFlow network using TF-TRT, depending on your platform.

x86 with NVIDIA GPU :

$docker pull
$docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /home/$USER/triton_blog/:/workspace/triton_blog


$docker pull
$docker run --runtime=nvidia -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /home/$USER/triton_blog/:/workspace/triton_blog

When you have the TensorFlow container running, use the  script to export your model with TF-TRT optimizations:

$python3 --modelPath faster_rcnn_inception_v2_coco_2018_01_28/frozen_inference_graph.pb --gpu_mem_fraction 0.6 --nms True --precision FP16 --max_batch_size 8 --min_segment_size 10

Replace the unoptimized model with the generated TF-TRT model:

$cp nms_frozen_tftrt_fp16_bs8_mss10.pb 1/model.graphdef


We looked at the performance comparison for the base FasterRCNN-InceptionV2 model running in native TensorFlow and the optimized TF-TRT model. In addition to TF-TRT, it was also running at lower precision. The input to the application is a 1080p video stream.

FP32 with no optimization12.80 fps 3.27 fps
FP16 with NMS on CPU Streams=1, BS=4 22.78 fps 13.88 fps
FP16 with NMS on CPU Streams=4, BS=4 32.36 fps 14.92 fps
Table 1. Performance comparison for the base FasterRCNN-InceptionV2 model running in native TensorFlow and the optimized TF-TRT model.

With the optimizations on a T4, we noticed that the  performance almost doubled over base FP32 without any optimization.

On Jetson NX, the improvements almost quadrupled with all the performance optimizations. DeepStream allows you to add more streams easily with a simple change to the configuration file. With four streams and a batch size of 4, you can get up to 32 FPS accumulated across four streams on T4 and almost 15 FPS on Jetson NX.


Get started with DeepStream SDK and simplify productionizing any open-source model for popular computer vision tasks such as object detection, classification, and more. The native integration of Triton with DeepStream 5.0 enables you to seamlessly deploy your model from any framework and achieve greater performance.

Additional resources:

Discuss (13)