A relatively new company, Fyma turns video into data–more specifically, movement data in physical space. The Fyma platform consumes customers’ live video streams all day, every day, and produces movement events (someone walking through a doorway or down a store aisle, for example).
One of the early lessons they learned is that their video-processing pipeline has to be simple, extensible, and performant all at the same time. With limited development resources, in the beginning they could only have one of those three. NVIDIA DeepStream has recently unlocked the ability to have all three simultaneously by shortening development times, increasing performance, and offering excellent software components such as GStreamer.
Challenges with live video streaming
Fyma is focused on consuming live video streams to ease implementation for their customers. Customers can be hesitant to implement sensors or any additional hardware on their premises, as they have already invested in security cameras. Since these cameras can be anywhere, Fyma can provide different object detection models to maximize accuracy in different environments.
Consuming live video streams is challenging in multiple aspects:
- Cameras sometimes produce broken video (presentation/decoding timestamps jump, reported framerate is wrong)
- Network issues cause video streams to freeze, stutter, jump, go offline
- CPU/memory load distribution and planning isn’t straightforward
- Live video stream is infinite
The infinite nature of live video streams means that Fyma’s platform must perform computer vision at least as quickly as frames arrive. Basically, the whole pipeline must work in real time. Otherwise, frames would accumulate endlessly.
Luckily, object detection has steadily improved in the last few years in terms of speed and accuracy. This means being able to detect objects from more than 1,000 images per second with mAP over 90%. Such advancements have enabled Fyma to provide computer vision at scale at a reasonable price to their customers.
Providing physical space analytics using computer vision (especially in real time) involves a lot more than just object detection. According to Kaarel Kivistik, Head of Software Development at Fyma, “To actually make something out of these objects we need to track them between frames and use some kind of component to analyze the behavior as well. Considering that each customer can choose their own model, set up their own analytics, and generate reports from gathered data, a simple video processing pipeline becomes a behemoth of a platform.”
Version 1: Hello world
Fyma began with coupling OpenCV and ffmpeg to a very simple Python application. Nothing was hardware-accelerated except their neural network. They were using Yolo v3 and Darknet at the time. Performance was poor, around 50-60 frames per second, despite their use of an AWS g4dn.xlarge instance with an NVIDIA Tesla T4 GPU (which they continue to use). The application functioned like this:
- OpenCV for capturing the video
- Darknet with Python bindings to detect objects
- Homemade IoU based multi-object tracker
While the implementation was fairly simple, it was not enough to scale. The poor performance was caused by three factors:
- Software video decoding
- Copying decoded video frames between processes and between CPU/GPU memory
- Software encoding the output while drawing detections on it
They worked to improve the first version with hardware video decoding and encoding. At the time, that didn’t increase overall speed by much since they still copied decoded frames from GPU to CPU memory and then back to GPU memory.
Version 2: Custom ffmpeg encoder
A real breakthrough in terms of speed came with a custom ffmpeg encoder, which was basically a wrapper around Darknet turning video frames into detected objects. Frame rates increased tenfold since they were now decoding on hardware without copying video frames between host and device memory.
But that increase in frame rate meant that part of their application was now written in C and came with the added complexity of ffmpeg with its highly complex build system. Still, their new component didn’t need much changing and proved to be quite reliable.
One downside to this system was that they were now constrained to using Darknet.
Version 2.1: DeepSORT
To improve object tracking accuracy, Fyma replaced a homemade IoU-based tracker with DeepSORT. The results were good, but they needed to change their custom encoder to output visual features of objects in addition to bounding boxes which DeepSORT required for tracking.
Bringing in DeepSORT improved accuracy, but created another problem: depending on the video content it sometimes used a lot of CPU memory. To mitigate this problem, the team resorted to “asynchronous tracking.” Essentially a worker-based approach, it involved each worker consuming metadata consisting of bounding boxes, and producing events about object movement. While this resolved the problem of uneven CPU usage, once again it made the overall architecture more complex.
Version 3: Triton Inference Server
While previous versions performed well, Fyma found that they still couldn’t run enough cameras on each GPU. Each video stream on their platform had an individual copy of whatever model it was using. If they could reduce the memory footprint of a single camera, it would be possible to squeeze a lot more out of their GPU instances.
Fyma decided to rewrite the ffmpeg-related parts of their application. More specifically, the application now interfaces with ffmpeg libraries (libav) directly through custom Python bindings.
This allowed Fyma to connect their application to NVIDIA Triton Inference Server which enabled sharing neural networks between camera streams. To keep the core of their object detection code the same, they moved their custom ffmpeg encoder code to a custom Triton backend.
While this solved the memory issues, it increased the complexity of Fyma’s application by at least three times.
Version 4: DeepStream
The latest version of Fyma’s application is a complete rewrite based on GStreamer and NVIDIA DeepStream.
“A pipeline-based approach with accelerated DeepStream components is what really kicked us into gear,” Kivistik said. “Also, the joy of throwing all the previous C-based stuff into the recycle bin while not compromising on performance, it’s really incredible. We took everything that DeepStream offers: decoding, encoding, inference, tracking and analytics. We were back to synchronous tracking with a steady CPU/GPU usage thanks to nvtracker.”
This meant events were now arriving in their database in almost real time. Previously, this data would be delayed up to a few hours, depending on how many workers were present and the general “visual” load (how many objects the whole platform was seeing).
Fyma’s current implementation runs a master process for each GPU instance. This master process in turn runs a GStreamer pipeline for each video stream added to the platform. Memory overhead for each camera is low since everything runs in a single process.
Regarding end-to-end performance (decoding, inference, tracking, analytics) Fyma is achieving frame rates up to 10x faster (around 500 fps for a single video stream) with accuracy improved up to 2-3x in comparison to their very first implementation. And Fyma was able to implement DeepStream in less than two months.
“I think we can finally say that we now have simplicity with a codebase that is not that large, and extensibility since we can easily switch out models and change the video pipeline and performance,” Kivistik said.
“Using DeepStream really is a no-brainer for every software developer or data scientist who wants to create production-grade computer vision applications.”
Using NVIDIA DeepStream, Fyma was able to unlock the power of its AI models and increase the performance of its vision AI applications while speeding up development time. If you would like to do the same and supercharge your development, visit the DeepStream SDK product page and DeepStream Getting Started.