Computer Vision / Video Analytics

Increasing Throughput and Reducing Costs for AI-Based Computer Vision with CV-CUDA

May 04, 2023

By Sagar Singh, Sandeep Hiremath, Naveen Kumar Rai and Dhaval Suthar

Discuss (3)

AI-Generated Summary

Dislike

CV-CUDA is an open-source library that enables efficient cloud-scale AI computer vision pipelines by providing GPU-accelerated computer vision and image-processing kernels.
Using CV-CUDA, along with NVIDIA Video Processing Framework (VPF) for decoding/encoding acceleration and TensorRT for inference optimization, can achieve up to 50x end-to-end throughput improvement with four NVIDIA L4 GPUs.
CV-CUDA achieves high performance through techniques like pre-allocated memory pooling, asynchronous operations, kernel fusion, and memory access efficiency, resulting in significant cost and energy savings for cloud-scale AI computer vision workloads.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Real-time cloud-scale applications that involve AI-based computer vision are growing rapidly. The use cases include image understanding, content creation, content moderation, mapping, recommender systems, and video conferencing.

However, the compute cost of these workloads is growing too, driven by demand for increased sophistication in the processing. The shift from still images to video is also now becoming the primary component of consumer Internet traffic. Given these trends, there is a critical need for building high-performance yet cost-effective computer vision workloads.

AI-based computer vision pipelines typically involve data pre- and post-processing steps around the AI inference model, which can be 50–80% of the entire workload. Common operators in these steps include the following:

Resizing
Cropping
Normalizing
Denoising
Tensor conversions

While developers may use NVIDIA GPUs to significantly accelerate the AI model inference in their pipelines, pre– and post-processing are still commonly implemented with CPU-based libraries. This leads to bottlenecks in the performance of the entire AI pipeline. The decoding and encoding processes that are typically part of an AI image or video processing pipeline can also be bottlenecked on the CPU, affecting the overall performance.

CV-CUDA optimization

CV-CUDA is an open-source library that enables you to build efficient cloud-scale AI computer vision pipelines. The library provides a specialized set of GPU-accelerated computer vision and image-processing kernels as standalone operators to easily implement highly efficient pre- and post-processing steps of the AI pipeline.

CV-CUDA can be used in a variety of common computer vision pipelines, such as image classification, object detection, segmentation, and image generation. For more information, see the keynote announcement at NVIDIA GTC Fall 2022.

In this post, we demonstrate the benefit of using CV-CUDA to enable end-to-end GPU acceleration for a typical AI computer vision workload achieving ~5x to up to 50x speedup in overall throughput. This can result in significant annual cloud cost savings on the order of hundreds of millions of USD, and hundreds of GWh of annual savings in energy consumption in data centers.

CPU bottlenecks addressed by GPU acceleration

CV-CUDA provides highly optimized, GPU-accelerated kernels as standalone operators for computer vision processing. These kernels can efficiently implement pre- and post-processing pipelines, resulting in significant throughput improvements.

Encode and decode operations can also be potential bottlenecks in the pipeline. With the optimized NVIDIA Video Processing Framework (VPF), you can efficiently optimize and run them as well. VPF is an open-source library from NVIDIA with Python bindings to C++ libraries. It provides full hardware acceleration for video decoding and encoding on GPUs.

To accelerate the entire end-to-end AI pipeline on the GPU (Figure 2), use CV-CUDA, along with VPF for decoding/encoding acceleration and TensorRT for further inference optimization. You can enable up to 50x end-to-end throughput improvement with four NVIDIA L4 GPUs, compared to CPU-based implementations in a typical pipeline.

The degree of improvement depends on factors such as the complexity of the inference DNN, the pre– and post-processing steps required, and the hardware. With multi-GPU nodes, you can expect the speedup factors to scale linearly for a given pipeline.

How CV-CUDA achieves high performance

CV-CUDA uses the power of GPU to achieve high performance:

Pre-allocated memory pooling to avoid duplicate allocation of GPU memory in the inference stage
Asynchronous operations
Kernel fusion-implementing combinations of operators using one GPU kernel to minimize unnecessary data transfer and kernel launch latency
Memory access efficiency by vectorizing global memory access and using fast shared memory
Computation efficiency, fast math, warp reduce/block reduction

Case study: End-to-end acceleration of a video-segmentation pipeline

Video-based segmentation is a common AI technique that segments pixels in a video frame based on their attributes. For example, in a virtual background or background blurring applications in video conferencing, it segments the foreground person or object and the background.

In this study, we discuss the performance evaluation of an AI video-segmentation pipeline deployed in the cloud using the NVIDIA T4 Tensor Core GPU instance on AWS, with a specific focus on compute cost optimization. The CPU attached to the instance is Intel Xeon Platinum 8362.

When the entire end-to-end AI pipeline is executed on GPU, you can anticipate significant cost savings. We then discuss the impact of this throughput performance speedup of the same workload on data-center energy consumption.

To conduct the experiment, we compared the GPU with CV-CUDA AI pipeline with the CPU with OpenCV implementation of the same pipeline, assuming that the inference workload in both cases runs on the GPU. Specifically, we deployed a ResNet-101 video-segmentation model pipeline (Figure 2), to perform AI background blurring.

In this case, we measured the latency for different stages and maximum throughput across the entire end-to-end pipeline. The pipeline encompasses multiple stages (Figure 4):

Video decoding
Preprocessing using operations like Downscale, Normalize, and Reformat
Inference using PyTorch
Post-processing using operations like Reformat, Upscale, BilateralFilter, Composition, and Blur
Video encoding

For both the CPU and GPU pipelines, we assumed that the inference workload ran on the GPU using PyTorch and TensorRT, respectively.

The conventional pipeline was built with OpenCV and PyTorch (GPU), implemented in Python, as this is a typical pattern for customers. The input video has a resolution of 1080p and consists of 474 frames with a batch size of 1. In this pipeline, the GPU is only used for inference, thanks to PyTorch, while the rest of the process is CPU-based:

The frames are decoded using OpenCV/ffmpeg.
The decoded images are pre-processed using OpenCV and fed into a PyTorch-powered DNN to detect which pixels belong to a cat, generating a mask.
In the post-processing stage, the output mask of the previous stage is composited with the original image and its blurred version resulting in the cat in the foreground and the background blurred.

For the GPU-based pipeline (Figure 4), we implemented the pre– and post-processing stages with optimized operators from CV-CUDA library, and the inference using the NVIDIA TensorRT library. We also accelerated the decoding and encoding parts of the pipeline on GPU using VPF.

Figure 5 shows that the end-to-end time for a one-frame batch was reduced from 132 ms to approximately 10 ms, demonstrating the impressive latency reduction achieved by the GPU pipeline. By employing a single NVIDIA T4 GPU, the CV-CUDA pipeline was ~13x faster compared to the CPU pipeline.

This result was obtained for a single process working on a single video. By deploying multiple processes to handle multiple videos concurrently, these optimizations enable higher throughput with the same hardware, resulting in significant cost and energy savings.

To better demonstrate the benefits that CV-CUDA delivers, we executed the GPU pipeline on different instances (one T4 GPU, one L4 GPU, four T4 GPUs, and four L4 GPUs).

The newly introduced NVIDIA L4 Tensor Core GPU, powered by the NVIDIA Ada Lovelace architecture, delivers low-cost, energy-efficient acceleration for video, AI, visual computing, graphics, and virtualization.

In Figure 6, the end-to-end throughput speedup on a single T4 GPU is ~5x compared to the CPU baseline, and the speedup is further improved on the new L4 GPU to ~12x. With multiple GPU instances, the performance almost scales linearly, for instance, ~19x and ~48x on four T4 GPUs and four L4 GPUs, respectively.

For computing annual cloud cost and energy consumption, we assumed a typical video workload of 500 video hours uploaded every minute to a video streaming platform. For the annual cloud cost, we only took T4 GPUs into consideration (L4 GPUs will be available in the future) and assumed one-year reserved pricing on Amazon EC2 G4 Instances.

Given this, the annual cost for this example video workload on a single T4 GPU would be ~1/5th of the CPU pipeline. This predicts typical cloud cost-saving estimates on the order of hundreds of millions of USD for such a workload.

For data centers, in addition to the costs associated with the hardware required to handle such huge workloads, energy efficiency is critical to lower energy costs and environmental impact.

In Figure 6 (right), the annual energy consumption (in GWh) is computed for the same video workload on a server with the respective hardware based on their average hourly power consumption. The energy consumption on a single L4 system is about ~1/12th of the CPU server. For a workload like the example video (500 hrs of video per minute), the annual energy savings estimates are on the order of hundreds of GWh.

These energy savings are significant as this is equivalent to avoiding greenhouse gas emissions from tens of thousands of passenger vehicles driven in a year, each driving ~11K miles per year.

CV-CUDA Beta v0.3.0 features

Now that you have seen the benefits of using CV-CUDA to accelerate AI computer vision workloads, here are some of the key features:

Open source: Apache 2.0 licensed open-source software on GitHub.
Supported operators: CV-CUDA provides 30+ specialized operators commonly used in the pre- and post-processing steps of AI computer vision workloads. These stateless, standalone operators are easily plugged in to existing custom processing frameworks. The common operators include ConvertTo, Custom crop, Normalize, PadStack, Reformat, and Resize. For more information, see the full list in the CV-CUDA Developer Guide.
New operators: CV-CUDA Beta v0.3.0 provides new operators like Remap, Find Contours, Non-Max Suppression, Thresholding, and Adaptive Thresholding.
Custom backend for NVIDIA Triton: You can now integrate CV-CUDA into a custom backend when building computer vision pipelines, with a sample application.
Multiple language APIs: CV-CUDA includes APIs for C/C++, and Python.
Framework interfaces: Easy-to-use and zero-copy interfaces to existing DL frameworks such as PyTorch and TensorFlow.
Batch support: All CV-CUDA operators are supported, so that you can achieve higher GPU utilization and better performance.
Uniform and variable shape batch support: CV-CUDA accepts tensors that have the same or different dimensions.
Sample applications: End-to-end accelerated image classification, object detection, and video segmentation sample applications.
Single-line PIP installation.
Installation, getting started, and API reference guides.

CV-CUDA is available for download

The open beta release (v0.3.0) of CV-CUDA is now available on GitHub.

CV-CUDA enables accelerating complex AI computer vision workloads in the cloud using optimized image and video processing kernels. The Python-friendly library can be easily integrated into existing pipelines with zero-copy interfaces to common deep learning frameworks like PyTorch and TensorFlow.

CV-CUDA, along with VPF and TensorRT, further optimizes the end-end AI workloads to achieve significant cost and energy savings. This makes it suitable for cloud-scale use cases:

For more information, see the following resources:

CV-CUDA documentation
NVIDIA Announces Microsoft, Tencent, Baidu Adopting CV-CUDA for Computer Vision AI
Runway Optimizes AI Image and Video Generation Tools Using CV-CUDA (video)
Overcome Pre- and Post-Processing Bottlenecks in AI-Based Imaging and Computer Vision Pipelines with CV-CUDA (video, free registration required)
Build AI-Based HD Maps for Autonomous Vehicles (video, free registration required)
Advance AI Applications with Custom GPU-Powered Plugins for NVIDIA DeepStream (video, free registration required)
FastDeploy: Full-Scene, High-Performance AI Deployment Tool (Presented by Baidu) (video, free registration required)
Accelerate Modern Video Applications in the Cloud using Video Codec SDK, CV-CUDA, and TensorRT on GPU (video in Mandarin, free registration required)

Using or considering CV-CUDA? Engage the product team for support. We’d love to hear from you.

Discuss (3)

About the Authors

About Sagar Singh
Sagar Singh is a data scientist at NVIDIA. He focuses on computer vision and natural language applications across consumer internet service providers. Before NVIDIA, Sagar obtained his Ph.D. from the Colorado School of Mines, Golden, and his Integrated M.S. from IIT Roorkee, where he researched high-performance computing and deep learning.

View all posts by Sagar Singh

About Sandeep Hiremath
Sandeep Hiremath is a lead technical product manager for computer vision at NVIDIA. He is an experienced product leader who specializes in the field of computer vision, machine learning and embedded systems. At NVIDIA, he is leading the product vision and strategy for a group of computer vision and image processing solutions for developers in automotive, healthcare, robotics, and research. In his previous role at MathWorks, his multiple customer-facing roles over more than a decade involved customer success, evangelism, and marketing solutions for MATLAB users in academia, robotics, and embedded vision.

View all posts by Sandeep Hiremath

About Naveen Kumar Rai
Naveen Kumar Rai is a senior software engineer at NVIDIA. He focuses on building computer vision deep learning applications. He has an M.Sc. in computer science from Stony Brook University, NY and a B.Tech from Indian Institute of Technology Guwahati, India in electronics and communication engineering.

View all posts by Naveen Kumar Rai

About Dhaval Suthar
Dhaval Suthar is a senior software engineer at NVIDIA. He focuses on training production-ready deep learning models using PyTorch and builds accelerated deep learning and computer vision pipelines. He has an M.S. in computer science from UCLA, where he was involved in large-scale simulation of behaviors of animal-agents using genetic algorithms and artificial neural networks.

View all posts by Dhaval Suthar