Computer Vision / Video Analytics

State-of-the-Art Real-time Multi-Object Trackers with NVIDIA DeepStream SDK 6.2

When you observe something over a period of time, you can find trends or patterns that enable predictions. With predictions, you can, for example, proactively alert yourself to take appropriate action.

More specifically, when you observe moving objects, the trajectory is one of the most important ways to understand the target object behavior, through which you can gain actionable insights (Video 1).

Video 1. Cheetah chasing its prey (Source: Adobe)

When the prey is distant, its longer-term motion pattern of motion is more important for the cheetah to plan the game. As it gets closer to the prey, the shorter-term motion pattern of the prey matters more when making predictions. The cheetah’s vision system is locked on the prey and keeps track of it until the game is over.

Generating the trajectory of an object requires identifying the same object over time even when there are abrupt changes in visual appearance or motion dynamics. However, it gets increasingly harder to do that when there are partial or full occlusions involved. Even predators or humans who have a great vision system often lose track of targets in extremely challenging situations where there are extended periods of occlusions or visually compelling objects nearby (Figure 1).

GIF of a crowd moving through a triangle of pedestrian walkways in an intersection.
Figure 1. Shibuya Crossing in Tokyo (Source: Adobe)

The object trajectories are a key enabler of many vision AI applications as well, such as in-store shopper behavior analytics for checkout queue analysis or store layout optimization and manufacturing inspection.

The NVIDIA DeepStream SDK offers GPU-accelerated multi-object trackers (MOT). In the latest DeepStream SDK 6.2 release, the multi-object trackers add significant improvements to tackle challenging occlusion issues effectively. They do this by leveraging deep neural network–based re-identification (ReID) models for target matching and association.

NvDCF can now be configured to use a ReID model to improve the association of the same object over a longer period, which may undergo prolonged occlusion or missed detections. NvDCF is still based on a discriminative correlation filter (DCF) approach for robust and efficient short-term tracking. It is also now supercharged with a neural net–powered, longer-term target re-association, resulting in an ultimate balance between efficiency, accuracy, and robustness in multi-object tracking.

NvDeepSORT performs the target association for tracking across video frames using the deep features extracted from a ReID model executed with NVIDIA TensorRT. It enables you to use custom ReID models so that you can bring your own model for multi-object tracking.

NvSORT is the NVIDIA-enhanced implementation of the simple online and real-time tracker (SORT) that uses a Kalman filter for state estimation and a data association algorithm for target association based on the target bounding boxes from the detector. NvSORT uses a cascaded data association algorithm for more robust target matching, which is an enhancement to the original SORT.

Thanks to the unified tracker architecture in the DeepStream SDK, the enhanced cascaded data association algorithm is used in all other types of MOTs as well: NvDCF and NvDeepSORT. The unified tracker architecture enables you to configure any object tracker through the tracker configuration file for the NvMultiObjectTracker library by enabling and disabling individual modules, depending on the tracker type of choice. You can also use NvDsTracker to build custom trackers.

The multi-object tracker portfolio offered in DeepStream 6.2 is summarized in Table 1.

Tracker typeDescription
NvSORTA lightweight, CPU-only implementation but still competitively accurate, thanks to its enhanced data association.
NvDeepSORTEnables you to use a public ReID model or bring your own model (BYOM) for robust target association on top of NvSORT.
NvDCFProduces the best accuracy and robustness by combining conventional machine learning (DCF) and deep learning (ReID) in a deliberate manner.
Enables skipping frames for inference (detection interval > 0) while still tracking objects in every frame, resulting in an overall pipeline performance boost with high accuracy.
Table 1. Multi-object trackers offered in DeepStream SDK 6.2

Pedestrian tracking

Now it’s time to generate the trajectories of objects in some interesting scenes using these object trackers and see how they look.

First, we show a pedestrian tracking use case. We used the PeopleNet v2.6 detector with different types of object trackers. Both the detector config parameters and the tracker parameters are tuned for PeopleNet v2.6. For more information, see DeepStream 6.2 Object Tracker documentation.

In Video 2, the real-time perception results are shown in the clockwise direction for PeopleNet-only, NvSORT, NvDeepSORT, and NvDCF for side-by-side comparison. The trajectory and bounding box (bbox) for different people are drawn with different colors for easier identification, and the color-coded trajectories are shown only when the corresponding object is present in the scene. The video was captured at x0.5 speed for easier comparison, but the actual data was produced in real time.

The labels on top of a bbox (for example, [21]:80 (0.24)) show the person ID (for example, 21), the tracking age (for example, 80), and the tracking confidence (for example, 0.24), respectively. The detector config params are configured differently for each tracker type for better tracking accuracy.

Video 2. Pedestrian tracking in an indoor lobby

The side-by-side visual comparison enables you to get qualitative insights of the behavior and quality of the different object tracker types when the same detection model is used. The NvDCF tracker has the highest accuracy, so we show its tracking results more closely in Video 3.

Video 3. Pedestrian tracking by the NvDCF tracker with full occlusions

This scene has a large pillar at the center that causes full occlusions to the people walking behind it for an extended period. This is a particularly challenging scenario for any object detection or tracking system. Some of the detection errors include partial or double detections and missed detections (Figure 2), and may occur quite frequently depending on the background and the physical environment.

Four snapshots show a bounding box that's too large, a bbox with two people, and missed detection in groups of three (one standing, one sitting).
Figure 2. Sample errors in detection: bounding boxes with two people, missed detection, and partial detection

Thanks to the newly introduced, ReID-based target re-association, however, the NvDCF tracker can track most of the people successfully (Video 4), even when they undergo such full occlusions behind the pillar. It not only re-associates the same objects before and after the large pillar after many frames but also recovers the missed detections (false negatives) caused by the pillar.

A more challenging scenario would be where, in addition to such environmental occlusions, there are many occlusions by other targets (Video 4). Such occlusions create varying degrees of partial occlusions where the size and aspect ratio of detected target bounding boxes changes significantly in short durations. This imposes challenges in target matching and association in tracking.

Video 4. Pedestrian tracking by NvDCF tracker with full and partial occlusions

Despite those challenges, however, you can see that the NvDCF tracker performs robust tracking in most cases with only a few ID switches. After a target leaves the scene, the target tracking is configured to terminate right away. Some targets that leave the scene in Video 4 are assigned different IDs after re-entry, which is expected.

Take a closer look at the subject, Object ID [3], shown in Figure 3 and Video 4. It undergoes severe full and partial occlusions multiple times throughout the entire trip, but it was tracked all the way from the beginning until it exits the scene.

Four snapshots show a person with a white shirt and tracking label, the same person in a group of three heading away from the camera; the same person behind another; and the same person heading toward the camera with two others.
Figure 3. Four snapshots from Video 4 of the person in a white shirt labeled Object ID [3]

Video 5 shows the target template used inside the tracker where the features are extracted and the correlation response map for the same target. The purple ‘x’ marks show the nearby target locations, while the yellow ‘+’ mark shows the current target location.

Video 5. (Left) Image template used by the tracker; (right) correlation response around the target

These results were generated using a relatively simple ResNet-10–based ReID model. To get even better results, we encourage you to try a more advanced custom ReID model of your choice.

The NvDCF tracker in DeepStream 6.2 is a state-of-the-art multi-object tracker that offers a great balance of accuracy and performance. Check out the MOT17 Challenge leaderboard where many trackers are actively being submitted from both academia and industry. The NvDCF tracker, shown as NvMOT_DSv62 in the MOT17 leaderboard, is one of the top trackers among online trackers that generate outputs in real time.

Vehicle tracking

For a vehicle tracking use case, we used the TrafficCamNet detector with the DeepStream multi-object trackers. We used a scene from a typical vehicle traffic monitoring system, which overlooks a busy intersection. There are small and large light poles and traffic signal poles present, creating a high number of occlusions. The occlusion issue is even more exacerbated by the relatively shallow camera vantage point, which causes many occlusions by other vehicles. Vegetation along the road adds to the complexity of the scene.

Vehicles undergo partial and full occlusions due to the traffic poles and trees, which results in a large number of missed and wrong detections. You can see how different types of object trackers handle these challenging situations in the side-by-side video (Video 6). The video was captured at x0.4 speed, but the actual data was produced in real time.

Video 6. Vehicle tracking in a busy intersection

In the top-left corner of the video where the object bboxes from the TrafficCamNet detector are shown, you may notice detection noises. These include jitters in detected bboxes, double detections that capture more than one object in a single bbox, partial detections due to occlusions, and so on.

When vehicles undergo occlusions behind traffic poles, these detection errors and noises get more severe. To see how the DeepStream multi-object trackers perform with these noisy detections, see Video 6 and also check more closely the tracking results by the NvDCF tracker in Video 7.

Video 7. Vehicle tracking by the NvDCF tracker


We encourage you to download DeepStream SDK 6.2 and try it out to enjoy the robust and efficient multi-object trackers for your use cases! For more information about the fundamentals of multi-object trackers, see the NVIDIA DeepStream Technical Deep Dive: Multi-Object Tracker video.

Discuss (3)