Computer Vision / Video Analytics

Unlocking a Simple, Extensible, and Performant Video Pipeline at Fyma with NVIDIA DeepStream

Aug 10, 2022

By Alvin Clark and Kaarel Kivistik

Discuss (1)

AI-Generated Summary

Dislike

Fyma, a computer vision company, uses NVIDIA DeepStream to simplify and scale its video processing pipeline, which analyzes live video streams to detect and track objects in physical spaces.
By leveraging NVIDIA DeepStream and GStreamer, Fyma achieved a significant increase in frame rates, reaching up to 500 fps for a single video stream, and improved accuracy by 2-3x compared to its initial implementation.
Fyma's adoption of NVIDIA DeepStream enabled the company to reduce the complexity of its codebase, improve performance, and increase extensibility, allowing it to easily switch between models and modify its video pipeline.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Providing computer vision in the cloud and at scale is a complex task. Fyma, a computer vision company, is tackling this complexity with the help of NVIDIA DeepStream.

A relatively new company, Fyma turns video into data–more specifically, movement data in physical space. The Fyma platform consumes customers’ live video streams all day, every day, and produces movement events (someone walking through a doorway or down a store aisle, for example).

One of the early lessons they learned is that their video-processing pipeline has to be simple, extensible, and performant all at the same time. With limited development resources, in the beginning they could only have one of those three. NVIDIA DeepStream has recently unlocked the ability to have all three simultaneously by shortening development times, increasing performance, and offering excellent software components such as GStreamer.

Challenges with live video streaming

Fyma is focused on consuming live video streams to ease implementation for their customers. Customers can be hesitant to implement sensors or any additional hardware on their premises, as they have already invested in security cameras. Since these cameras can be anywhere, Fyma can provide different object detection models to maximize accuracy in different environments.

Consuming live video streams is challenging in multiple aspects:

Cameras sometimes produce broken video (presentation/decoding timestamps jump, reported framerate is wrong)
Network issues cause video streams to freeze, stutter, jump, go offline
CPU/memory load distribution and planning isn’t straightforward
Live video stream is infinite

The infinite nature of live video streams means that Fyma’s platform must perform computer vision at least as quickly as frames arrive. Basically, the whole pipeline must work in real time. Otherwise, frames would accumulate endlessly.

Luckily, object detection has steadily improved in the last few years in terms of speed and accuracy. This means being able to detect objects from more than 1,000 images per second with mAP over 90%. Such advancements have enabled Fyma to provide computer vision at scale at a reasonable price to their customers.

Providing physical space analytics using computer vision (especially in real time) involves a lot more than just object detection. According to Kaarel Kivistik, Head of Software Development at Fyma, “To actually make something out of these objects we need to track them between frames and use some kind of component to analyze the behavior as well. Considering that each customer can choose their own model, set up their own analytics, and generate reports from gathered data, a simple video processing pipeline becomes a behemoth of a platform.”

Version 1: Hello world

Fyma began with coupling OpenCV and ffmpeg to a very simple Python application. Nothing was hardware-accelerated except their neural network. They were using Yolo v3 and Darknet at the time. Performance was poor, around 50-60 frames per second, despite their use of an AWS g4dn.xlarge instance with an NVIDIA Tesla T4 GPU (which they continue to use). The application functioned like this:

OpenCV for capturing the video
Darknet with Python bindings to detect objects
Homemade IoU based multi-object tracker

While the implementation was fairly simple, it was not enough to scale. The poor performance was caused by three factors:

Software video decoding
Copying decoded video frames between processes and between CPU/GPU memory
Software encoding the output while drawing detections on it

They worked to improve the first version with hardware video decoding and encoding. At the time, that didn’t increase overall speed by much since they still copied decoded frames from GPU to CPU memory and then back to GPU memory.

Version 2: Custom ffmpeg encoder

A real breakthrough in terms of speed came with a custom ffmpeg encoder, which was basically a wrapper around Darknet turning video frames into detected objects. Frame rates increased tenfold since they were now decoding on hardware without copying video frames between host and device memory.

But that increase in frame rate meant that part of their application was now written in C and came with the added complexity of ffmpeg with its highly complex build system. Still, their new component didn’t need much changing and proved to be quite reliable.

One downside to this system was that they were now constrained to using Darknet.

Version 2.1: DeepSORT

To improve object tracking accuracy, Fyma replaced a homemade IoU-based tracker with DeepSORT. The results were good, but they needed to change their custom encoder to output visual features of objects in addition to bounding boxes which DeepSORT required for tracking.

Bringing in DeepSORT improved accuracy, but created another problem: depending on the video content it sometimes used a lot of CPU memory. To mitigate this problem, the team resorted to “asynchronous tracking.” Essentially a worker-based approach, it involved each worker consuming metadata consisting of bounding boxes, and producing events about object movement. While this resolved the problem of uneven CPU usage, once again it made the overall architecture more complex.

Version 3: Triton Inference Server

While previous versions performed well, Fyma found that they still couldn’t run enough cameras on each GPU. Each video stream on their platform had an individual copy of whatever model it was using. If they could reduce the memory footprint of a single camera, it would be possible to squeeze a lot more out of their GPU instances.

Fyma decided to rewrite the ffmpeg-related parts of their application. More specifically, the application now interfaces with ffmpeg libraries (libav) directly through custom Python bindings.

This allowed Fyma to connect their application to NVIDIA Triton Inference Server which enabled sharing neural networks between camera streams. To keep the core of their object detection code the same, they moved their custom ffmpeg encoder code to a custom Triton backend.

While this solved the memory issues, it increased the complexity of Fyma’s application by at least three times.

Version 4: DeepStream

The latest version of Fyma’s application is a complete rewrite based on GStreamer and NVIDIA DeepStream.

“A pipeline-based approach with accelerated DeepStream components is what really kicked us into gear,” Kivistik said. “Also, the joy of throwing all the previous C-based stuff into the recycle bin while not compromising on performance, it’s really incredible. We took everything that DeepStream offers: decoding, encoding, inference, tracking and analytics. We were back to synchronous tracking with a steady CPU/GPU usage thanks to nvtracker.”

This meant events were now arriving in their database in almost real time. Previously, this data would be delayed up to a few hours, depending on how many workers were present and the general “visual” load (how many objects the whole platform was seeing).

Fyma’s current implementation runs a master process for each GPU instance. This master process in turn runs a GStreamer pipeline for each video stream added to the platform. Memory overhead for each camera is low since everything runs in a single process.

Regarding end-to-end performance (decoding, inference, tracking, analytics) Fyma is achieving frame rates up to 10x faster (around 500 fps for a single video stream) with accuracy improved up to 2-3x in comparison to their very first implementation. And Fyma was able to implement DeepStream in less than two months.

“I think we can finally say that we now have simplicity with a codebase that is not that large, and extensibility since we can easily switch out models and change the video pipeline and performance,” Kivistik said.

“Using DeepStream really is a no-brainer for every software developer or data scientist who wants to create production-grade computer vision applications.”

Summary

Using NVIDIA DeepStream, Fyma was able to unlock the power of its AI models and increase the performance of its vision AI applications while speeding up development time. If you would like to do the same and supercharge your development, visit the DeepStream SDK product page and DeepStream Getting Started.

Discuss (1)

About the Authors

About Alvin Clark
Alvin Clark is a product marketing manager working on DeepStream. Alvin started his career as a design engineer before moving on to technical sales and marketing. He has worked with customers across multiple industries on applications ranging from satellite systems and surgical robots to deep-sea submersibles. Alvin holds an engineering degree from the University of California, San Diego, and is currently pursuing a master's degree at Georgia Tech.

View all posts by Alvin Clark

About Kaarel Kivistik
Kaarel is heading the software department at Fyma. With over 10 years of experience in software development, he is well versed in multiple languages and environments. He has designed and engineered the way Fyma's platform works.

View all posts by Kaarel Kivistik