Computer Vision / Video Analytics

VPF: Hardware-Accelerated Video Processing Framework in Python

Support for accelerated hardware video encoding began with the Kepler generation of NVIDIA GPUs, and all GPUs since the Fermi generation support hardware video acceleration decoding through the NVIDIA Video Codec SDK

While showing great performance and flexibility, it requires knowledge of C/C++. Another option is to use third party libraries and applications like FFmpeg or GStreamer which again require C/C++ expertise to be built-in and customized per user.

However, hardware accelerated video features might be useful for a broader audience, and the intent of VPF (Video Processing Framework) is a simple, yet powerful tool for utilizing NVIDIA GPUs when working with video using Python. VPF utilizes the NVIDIA Video Codec SDK for flexibility and performance, and provides developers with the ease-of-use inherent to Python. 

VPF is a set of C++ libraries and Python bindings which provides full hardware acceleration for video processing tasks such as decoding, encoding, transcoding and GPU-accelerated color space and pixel format conversions. VPF is a CMake-based open source cross-platform software released under Apache 2 license. It relies on FFmpeg library for (de)muxing and pybind11 project for building Python bindings.

VPF exports C++ video processing classes into PyNvCodec Python module. To illustrate the ease of use, let’s start with a quick code snippet which shows how to do fully hardware-accelerated video transcoding on  a GPU without raw frames copying between Host and Device:

 

import PyNvCodec as nvc

gpuID = 0
encFile = "big_buck_bunny_1080p_h264.mov"
xcodeFile = open("big_buck_bunny_1080p.h264", "wb")

nvDec = nvc.PyNvDecoder(encFile, gpuID)
nvEnc = nvc.PyNvEncoder({'preset': 'hq', 'codec': 'h264', 's': '1920x1080'}, gpuID)

while True:
    rawSurface = nvDec.DecodeSingleSurface()
    # Decoder will return zero surface if input file is over;
    if not (rawSurface.GetCudaDevicePtr()):
        break
    
    encFrame = nvEnc.EncodeSingleSurface(rawSurface)
    if(encFrame.size):
        frameByteArray = bytearray(encFrame)
         xcodeFile.write(frameByteArray)

# Encoder is asynchronous, so we need to flush it
encFrames = nvEnc.Flush()
for encFrame in encFrames:
    encByteArray = bytearray(encFrame)
    xcodeFile.write(encByteArray)

 

Despite a simple design, VPF demonstrates good performance. The transcoding sample shown above is enough to saturate Nvenc unit on an RTX 5000 GPU as illustrated below:

Big Buck Bunny sequence contains 14,315 frames and can be transcoded within 32 seconds which gives ~447fps without using any advanced techniques such as producer-consumer pattern with decoded frames queue being shared by decoder and encoder launched in separate threads. Since all the transcoding is done on the GPU, there’s no noticeable CPU load.

The core part of VPF are PyNvDecoder and PyNvEncoder classes which are Python bindings to NVIDIA Video Codec SDK. There are two major data types which VPF operates:

  • NumPy arrays for CPU-side data
  • User-transparent Surface class which represents GPU-side data

 

Since GPU-side memory objects allocation is complex and influences performance heavily, all VPF classes methods which return Surface, own them and may reuse previously returned Surface upon next call. Unlike that, VPF classes methods return new NumPy array instance every time they are called. Move constructors are used for that to avoid memory copy overheads.

 

Both PyNvDecoder and PyNvEncoder classes support only NV12 pixel format for the sake of simplicity. Other pixel formats are supported with set of color space and pixel format conversion classes. All conversions are GPU-accelerated and done in VRAM memory for better performance.

 

PyNvDecoder class has five main methods:

DecodeSingleSurface

Decodes single frame from input video, returns Surface with decoded pixels. The next time a user calls this method, previously returned Surface may be reused. If the frame isn’t decoded, decoded Surface’s GetCudaDevicePtr method will return zero.

DecodeSingleFrame

Decodes single frame from input video, returns NumPy array with decoded pixels. Next time user calls this method, another NumPy array instance will be returned. If frame isn’t decoded, it will return empty NumPy array. This operation does Device to Host memory copy.

Width

Returns decoded frame width.

Height

Returns decoded frame height.

PixelFormat

Returns decoded frame pixel format.

User may mix DecodeSingleSurface and DecodeSingleFrame calls, it will not break decoder internal state. Decoder class supports H.264 and H.265 codecs.

 

PyNvEncoder class has six methods:

EncodeSingleSurface

Takes NV12 Surface with raw pixels, encodes it and returns elementary video bitstream as NumPy array. The encoder is asynchronous, so this method may return empty array upon the first few calls (depending on encoder settings), which is not an error.

EncodeSingleFrame

Takes NumPy array with raw pixels, encodes it and returns elementary video bitstream as NumPy array. The encoder is asynchronous, so this method may return empty array upon first few calls (depending on encoder settings), which is not an error.

Flush

Flushes the encoder. It does not return unless all raw frames in encoder’s queue are encoded and returns a list of NumPy arrays with elementary stream bytes.

Width

Returns encoded frame width.

Height

Returns encoded frame height.

PixelFormat

Returns encoded frame pixel format.

If a user mixes EncodeSingleSurface and EncodeSingleFrame calls, it will not break the encoder internal state. Also, PyNvEncoder can take an input frame of arbitrary resolution and resize it on the GPU on-the-fly before the actual encoding. The encoder class supports H.264 and H.265 codecs, and has low latency, so in the end of the encoding session, one should call Flush method that will flush the encoder frame queue.

 

Below is a list of supported encoder parameters:

Parameter

Type

Meaning

profile

string

Encoding profile.

Possible values for h264: baseline, main, high<,code>.

Possible values for hevc: main.

lookahead

integer

Size of look-ahead.

vbvinit

integer + unit

VBV initial delay in bits, can be in unit of 1, K, M.

bitrate

integer + unit

Average bit rate, can be in unit of 1, K, M.

fps

integer

Frame rate.

preset

string

Encoding preset. 

Possible values: default, hp, hq, bd, ll, ll_hp, ll_hq, lossless, lossless_hp.

constqp

integer

QP value for constqp rate control mode.

qmin

integer

Min QP value.

qmax

integer

Max QP value.

cq

integer

Target constant quality level for VBR mode

Possible values: [0,51], 0=auto.

initqp

integer

Initial QP value.

temporalaq

 

(No value) Enable temporal AQ.

vbvbufsize

integer + unit

VBV buffer size in bits, can be in unit of 1, K, M.

bf

integer

Number of consecutive B-frames.

rc

string

Rate control mode.

Possible values: constqp, vbr, cbr, cbr_lowdelay_hq, cbr_hq, vbr_hq.

aq

integer

Enable spatial AQ and set its strength

Possible values: [0,15], 0=auto.

maxbitrate

integer + unit

Max bit rate, can be in unit of 1, K, M.

gop

integer

Length of GOP (Group of Pictures).

codec

string

Video codec. Possible values: h264, hevc.

 

HardwareSurface class is a wrapper around CUdeviceptr:

GetCudaDevicePtr

Returns CUdeviceptr handle to CUDA memory object.

 

For memory transfers between Host and Device, there are two classes named PyFrameUploader and PySurfaceDownloader

 

PyFrameUploader is used for uploading a NumPy array to GPU. It has only one method:

UploadSingleFrame

Uploads a numpy array to GPU, returns handle to uploaded Surface. Next time user calls this method, previously returned Surface may be reused.

 

PySurfaceDownloader is used to download a Surface from GPU. It also has only one method:

DownloadSingleSurface

Downloads a GPU-side Surface into CPU-side numpy array. Next time user calls this method, another numpy array instance will be returned.

 

Finally, there’s a PySurfaceConverter class which is used for GPU-accelerated color space and pixel format conversion. Below is the list of supported conversions:

  • YUV420 to NV12 
  • NV12 to YUV420
  • NV12 to RGB

PySurfaceConverter has one method:

Execute

Performs conversion on the GPU, returns handle to Surface with output format. The next time a user calls this method, previously returned Surface may be reused.

 

VPF provides developers with a simple, yet powerful Python tool for fully hardware -accelerated video encoding, decoding and processing classes. Thanks to the C++ code underneath the Python bindings, it allows you to achieve high GPU utilization within tens of code lines. Decoded video frames are exposed either as NumPy arrays or CUDA device pointers for simpler interaction and features extension. VPF does not impose any restrictions above the NVIDIA Video Codec SDK and allows you to fully utilize the potential of NVIDIA professional-grade GPUs.

Discuss (3)

Tags