Computer Vision / Video Analytics

Real-Time Vision AI From Digital Twins to Cloud-Native Deployment with NVIDIA Metropolis Microservices and NVIDIA Isaac Sim

Decorative image of avatars working in different office locations.

As vision AI complexity increases, streamlined deployment solutions are crucial to optimizing spaces and processes. NVIDIA accelerates development, turning ideas into reality in weeks rather than months with NVIDIA Metropolis AI workflows and microservices

In this post, we explore Metropolis microservices features:

  • Cloud-native AI application development and deployment with NVIDIA Metropolis microservices
  • Simulation and synthetic data generation with NVIDIA Isaac Sim
  • AI model training and fine-tuning with NVIDIA TAO Toolkit
  • Automated accuracy tuning with PipeTuner
Diagram starts workflow with simulation and synthetic data generation, includes model training and cloud-native application development, and ends with deployment.
Figure 1. Scalable recipe for modern vision AI development
Video 1. End-to-End Workflow from Digital Twin to Multi-Camera Tracking

Cloud-native AI application development and deployment with Metropolis microservices and workflows

Managing and automating infrastructure with AI is challenging, especially for large and complex spaces like supermarkets, warehouses, airports, ports, and cities. It’s not just about scaling the number of cameras, but building vision AI applications that can intelligently monitor, extract insights, and highlight anomalies across hundreds or thousands of cameras in spaces of tens or hundreds of thousands of square feet.

A microservices architecture enables scalability, flexibility, and resilience for complex multi-camera AI applications by breaking them down into smaller, self-contained units that interact through well-defined APIs. This approach enables the independent development, deployment, and scaling of each microservice, making the overall application more modular and easier to maintain.

Key components of real-time, scalable multi-camera tracking and analytics applications include the following:

  • A multi-camera tracking module to aggregate local information from each camera and maintain global IDs for objects across the entire scene
  • Different modules for behavior analytics and anomaly detection
  • Software infrastructure like a real-time scalable message broker (for example, Kafka), database (for example, Elasticsearch)
  • Standard interfaces to connect with downstream services needing on-demand metadata and video streams
  • Each module must be a cloud-native microservice to enable your application to be scalable, distributed, and resilient
Connect IoT sensors, application dashboards, or business services to reference or customized workflows. Select the Metropolis microservices needed for your use case or customize them using Metropolis SDKs and tools.
Figure 2. Scalable vision AI application workflow using Metropolis microservices

Metropolis microservices give you powerful, customizable, cloud-native building blocks for developing vision AI applications and solutions. They make it far easier and faster to prototype, build, test, and scale deployments from edge to cloud with enhanced resilience and security. Accelerate your path to unlock business insights for a wide range of spaces, ranging from warehouses and supermarkets to airports and roadways. 

Diagram shows icons for six microservices: Perception, Multi-Camera Fusion, Behavior Analytics, Query By Example, Behavior Learning, and Media Management.
Figure 3. A suite of Metropolis microservices for vision AI applications

For more information and a comprehensive list of microservices, see the NVIDIA Metropolis microservices documentation

The next sections cover some key microservices in more detail:

  • Media Management
  • Perception
  • Multi-Camera Fusion

Media Management microservice

The Media Management microservice is based on the NVIDIA Video Storage Toolkit (VST) and provides an efficient way to manage cameras and videos. VST features hardware-accelerated video decoding, streaming, and storage. 

Diagram lists features such as ONVIF discovery, storage management, streaming management, metadata overlay, RTSP passthrough, and WebRTC playback.
Figure 4. Manage cameras and video files with the Media Management microservice

It supports ONVIF S-profile devices ONVIF discovery with control and dataflow. You can manage devices manually by IP address or RTSP URLs. It supports both H264 and H265 video formats. VST is designed for security, industry-standard protocols, and multiplatform.

Perception microservice

The Perception microservice takes input data from the Media Management microservice and generates perception metadata (bounding boxes, single-camera trajectories, Re-ID embedding vectors) within individual streams. It then sends these data to downstream analytics microservices for further reasoning and insight.

Diagram shows that the Perception microservice takes input as an RTSP stream, handles stream processing, detects the foundation model, re-identifies the foundtation model and does single-view tracking, and sends inference and tracker metadata in protobuf to the message broker (Kafka).
Figure 5. Detect and track objects using the Perception microservice

The microservice is built with the NVIDIA DeepStream SDK. It offers a low-code or no-code approach to real-time video AI inference by providing pre-built modules and APIs that abstract away low-level programming tasks. With DeepStream, you can configure complex video analytics pipelines through a simple configuration file, specifying tasks such as object detection, classification, tracking, and more. 

Multi-Camera Fusion microservice

The Multi-Camera Fusion microservice aggregates and processes information across multiple camera views, taking perception metadata from the Perception microservice through Kafka (or any custom source with a similar message schema) and extrinsic calibration information from the Camera Calibration Toolkit as inputs. 

Diagram shows that the Multi-Camera Fusion microservice takes input object metadata from the Perception microservice and camera calibration data from Kafka, tracks the object in top down view, creates a global ID of objects, and sends these IDs to the message broker for further analytics or visualization.
Figure 6.  Track objects in multiple cameras using the Multi-Camera Fusion microservice
  • Inside the microservice, the data goes to the Behavior State Management module to maintain the behavior of previous batches and concatenates with data from incoming micro-batches, creating trajectories. 
  • Next, the microservice performs two-step hierarchical clustering, re-assigning behaviors that co-exist and suppressing overlapping behaviors. 
  • Finally, the ID Merging module consolidates individual object IDs into global IDs, maintaining a correlation of objects observed across multiple sensors.

Metropolis AI workflows

Reference workflows and applications are provided to help you evaluate and integrate advanced capabilities. 

For example, the Multi-Camera Tracking (MTMC) workflow is a reference workflow for video analytics that performs multi-target, multi-camera tracking and provides a count of the unique objects seen over time.

Diagram shows the multi-camera tracking workflow, starting from taking input data from real or stored video streams, sending it through the analytics microservices, to visualizing or storing data with the reference web interface.
Figure 7. Multi-Camera Tracking workflow using multiple Metropolis microservices
  • The application workflow takes live camera feeds as input from the Media Management microservice.
  • It performs object detection and tracking through the Perception microservice. 
  • The metadata from the Perception microservice goes to the Multi-Camera Fusion microservice to track the objects in multiple cameras. 
  • A parallel thread goes to the extended Behavior Analytics microservice to first preprocess the metadata, transform image coordinates to world coordinates, and then run a state management service. 
  • The data then goes to the Behavior Analytics microservice, which with the MTMC microservice, provides various analytics functions as API endpoints.
  • The Web UI microservice visualizes the results.

For more information, see the Multi-Camera Tracking quickstart guide.

Interface camera calibration

In most Metropolis workflows, analytics are performed in real-world coordinate systems. To convert camera coordinates to real-world coordinates, a user-friendly, web-based Camera Calibration Toolkit is provided, with features such as the following:

  • Easy camera import from VMS
  • Interface for reference point selection between camera image and floorplan
  • On-the-fly reprojection error for self-checking
  • Add-ons for ROIs and tripwires
  • File upload for images or building map
  • Export to web or API
Screenshot shows the camera calibration toolkit mapping points from the image domain to the map domain.
Figure 8. Metropolis Camera Calibration Toolkit

This intuitive toolkit simplifies the process of setting up and calibrating cameras for seamless integration with Metropolis workflows and microservices.

2024 AI City Challenge

The NVIDIA Multi-Camera Tracking workflow was evaluated using the Multi-Camera People Tracking Dataset from the 8th 2024 AI City Challenge Workshop in conjunction with CVPR 2024. This dataset is the largest in this field, including 953 cameras, 2,491 people, and over 100M bounding boxes, divided into 90 subsets. The total duration of the dataset’s videos is 212 minutes, captured in high definition (1080p) at a frame rate of 30 frames per second. 

The NVIDIA approach achieved an impressive HOTA score of 68.7%, ranking second among 19 international teams (Figure 9). 

Snapshot of the leaderboard, highlighting the NVIDIA workflow’s competitive edge against the state-of-the-art accuracy.
Figure 9. Benchmark leaderboard for MTMC tracking at the 2024 AI City Challenge

This benchmark only focuses on accuracy in batch mode, where the application has access to entire videos. In online or streaming operating conditions, the application can only access historical data but not future data compared to its current frame. This may make some submitted approaches impractical or require significant re-architecting for real deployment. Factors that are not considered in the benchmark include the following:

  • Latency from input to prediction
  • Runtime throughput (how many streams can run given a compute platform or budget)
  • Deployability
  • Scalability

As a result, most teams do not have to optimize for these factors. 

In contrast, Multi-Camera Tracking, as part of Metropolis microservices, must consider and optimize for all these factors in addition to accuracy for real-time, scalable, multi-camera tracking to be deployed in production use cases.

One-click microservices deployment 

Metropolis microservices support one-click deployment on AWS, Azure, and GCP. The deployment artifacts and instructions are downloadable on NGC, so you can quickly bring up an end-to-end MTMC application on your own cloud account by providing a few prerequisite parameters. Each workflow is packaged with a Compose file, enabling deployment with Docker Compose as well.

For edge-to-cloud camera streaming, cameras at the edge can be connected, using a Media Management client (VST Proxy) running at the edge, to a Metropolis application running in any of the CSPs for analytics.

This streamlined deployment process empowers you to rapidly prototype, test, and scale their vision AI applications across various cloud platforms, reducing the time and effort required to bring solutions to production.

Simulation and synthetic data generation with Isaac Sim

Training AI models for specific use cases demands diverse, labeled datasets, often costly and time-intensive to collect. Synthetic data, generated through computer simulations, offers a cost-effective alternative that reduces training time and expenses. 

Simulation and synthetic data play a crucial role in the modern vision AI development cycle:

  • Generating synthetic data and combining it with real data to improve model accuracy and generalizability
  • Helping develop and validate applications with multi-camera tracking and analytics
  • Adjusting deployment environments like proposing optimized camera angles or coverage

NVIDIA Isaac Sim integrates seamlessly into the synthetic data generation (SDG) pipeline, providing a sophisticated companion to enhance AI model training and end-to-end application design and validation. You can generate synthetic data across a wide range of applications, from robotics and industrial automation to smart cities and retail analytics.

Diagram shows a workflow of 3D assets, scene generation, procedural scenario generation, batch generation of annotated synthetic data, AI model training and optimization, and inference. The last four steps use real or ground truth data and an accuracy metric must be met for scenario generation through model training.
Figure 10. Create a synthetic dataset for AI training with NVIDIA Isaac Sim

The Omni.Replicator.Agent (ORA) extension in Isaac Sim streamlines the simulation of agents like people and autonomous moving robots (AMRs) and the generation of synthetic data from scenes containing them.

ORA offers GPU-accelerated solutions with default environments, assets, and animations, supporting custom integration. It includes an automated camera calibration feature, producing calibration information compatible with the workflows in Metropolis microservices, such as the Multi-Camera Tracking (MTMC) workflow mentioned later in this post.

Warehouse image shows different assets such as people with different attributes, warehouse aisles at particular camera angles, and a forklift or AMR.
Figure 11. Scene created with ORA Extension

AI model training and fine-tuning with TAO Toolkit

Metropolis microservices employ some CNN-based and Transformer-based models, which are initially pretrained on a real dataset and augmented with synthetic data for more robust generalization and handling of rare cases.

  • CNN-based models:
    • PeopleNet: Based on NVIDIA DetectNet_v2 architecture. Pretrained on over 7.6M images with more than 71M person objects.
    • ReidentificationNet: Uses a ResNet-50 backbone. Trained on a combination of real and synthetic datasets, including 751 unique IDs from the Market-1501 dataset and 156 unique IDs from the MTMC people tracking dataset.
  • Transformer-based models:
    • PeopleNet-Transformer: Uses the DINO object detector with a FAN-Small feature extractor. Pretrained on the OpenImages dataset and fine-tuned on a proprietary dataset with over 1.5M images and 27M person objects.
    • ReID Transformer model: Employs a Swin backbone and incorporates self-supervised learning techniques such as SOLIDER to generate robust human representations for person re-identification. The pretraining dataset includes a combination of proprietary and open datasets such as Open Image V5, with a total of 14,392 synthetic images with 156 unique IDs and 67,563 real images with 4,470 IDs.

In addition to directly using these models, you can use the NVIDIA TAO Toolkit to easily fine-tune on custom datasets for improved accuracy and optimize newly trained models for inference throughput on practically any platform. The TAO Toolkit is built on TensorFlow and PyTorch.

Diagram starts with selecting pretrained models from either NGC or an ONNX model zoo, using TAO REST APIs to train and optimize these models with your custom dataset, and then deploying these models on hardware such as  GPU, CPU, MCU, or DLA.
Figure 12. NVIDIA TAO Toolkit architecture

Automated accuracy tuning with PipeTuner

PipeTuner is a new developer tool designed to simplify the tuning of AI pipelines. 

AI services typically incorporate a wide array of parameters for inference and tracking, and finding the optimal settings to maximize accuracy for specific use cases can be challenging. Manual tuning requires deep knowledge of each pipeline module and becomes impractical with extensive, high-dimensional parameter spaces.

PipeTuner addresses these challenges by automating the process of identifying the best parameters to achieve the highest possible key performance indicators (KPIs) based on the dataset provided. By efficiently exploring the parameter space, PipeTuner simplifies the optimization process, making it accessible without requiring technical knowledge of the pipeline and its parameters.

Diagram shows taking an application pipeline and datasheet with labels from the user, performing a param search with Perception and MTMC, executing with DeepStream Perception and MTMC Analytics, and evaluating with IDF1 and HOTA.
Figure 13. NVIDIA PipeTuner Toolkit workflow


Metropolis microservices simplify and accelerate the process of prototyping, building, testing, and scaling deployments from edge to cloud, offering improved resilience and security. The microservices are flexible, easy to configure with zero coding, and packaged with efficient CNN and transformer-based models to fit your requirements. Deploy entire end-to-end workflows with a few clicks to the public cloud or in production.

You can create powerful, real-time multi-camera AI solutions with ease using NVIDIA Isaac Sim, NVIDIA TAO Toolkit, PipeTuner, and NVIDIA Metropolis microservices. This comprehensive platform empowers your business to unlock valuable insights and optimize your spaces and processes across a wide range of industries.

For more information, see the following resources:

Discuss (0)