Leveraging the Hardware JPEG Decoder and NVIDIA nvJPEG Library on NVIDIA A100 GPUs

According to surveys, the average person produces 1.2 trillion images that are captured by either a phone or a digital camera. The storage of such images, especially in high-resolution raw format, uses lots of memory.

JPEG refers to the Joint Photographic Experts Group, which celebrated its 25th birthday in 2017. The JPEG standard specifies the codec, which defines how an image is compressed into a bitstream of bytes and is decompressed back to an image.

The main purpose of the JPEG codec is to minimize the file size of photographic image files. JPEG is a lossy compression format, which means that it doesn’t store full pixel data from an original image. One of JPEG’s advantages is that it allows you to fine-tune the amount of compression used. This results in good image quality when used correctly while also resulting in the smallest reasonable file size.

The key components of JPEG compression are as follows:

Color space conversion allows you to separate luminance (Y) and chrominance (Cb, Cr) components. Downsampling of Cb and Cr allows you to reduce file size with almost unnoticeable losses of quality as human perception is less sensitive to these image components. This is not part of the core standard but is defined as part of the JFIF format.
Block-based discrete cosine transform (DCT) allows for the compaction of data at lower frequencies.
Quantization allows rounding coefficients for high-frequency details. Losing these details is usually okay as the human eye usually cannot easily distinguish high-frequency content.
Progressive encoding allows you to preview a low-quality version of the whole image after the partial decoding of its bitstream.

The following photos (Figure 1) demonstrate the image quality losses with JPEG compression. The original butterfly image is in BMP format (512×512, 24-bit, 769 kB, no compression) and the same image is then shown in JPEG format with quality compression coefficient 50%, subsampling 4:2:0, 24-bit, image size 33 KB.

*Figure 1a. Original butterfly image (no compression, Size 512×512, 24-bit), 769 KB.*

*Figure 1b. Compressed butterfly image (quality compression coefficient 50%, subsampling 4:2:0, 24-bit), 33 KB.*

How JPEG works

Figure 2 shows one of the JPEG encoder’s common configurations.

*Figure 2. Diagram of the JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.*

First, JPEG encoding starts with an RGB color image.

The second step involves the color conversion to YCbCr color space Y representing luminance (brightness) and the Cb and Cr channels representing the chroma (red and blue projections). The Cb and Cr channels are then downsampled by a predetermined factor, usually two or three. This downsampling gives you the first stage of compression.

At the next stage, each channel is split into 8×8 blocks and computes the DCT, a transform in frequency space similar to a Fourier transform. The DCT itself is lossless and reversible, transforming an 8×8 spatial block into 64 channels.

The DCT coefficients are then quantized, a process which is lossy and comprises the second compression stage. Quantization is controlled by the JPEG quality parameter, with lower quality settings corresponding to more severe compression and resulting in smaller files.

Quantization thresholds are specific to each spatial frequency and have been carefully designed. Less compression is applied to low frequencies than to high frequencies, as the human eye is more sensitive to subtle errors over broad areas than changes in magnitude of high-frequency signals.

The final stage is where the quantized DCT coefficients are compressed losslessly with Huffman encoding and stored in a JPEG file, like the image.jpg shown in Figure 2.

Figure 3 shows the JPEG decoding process on NVIDIA GPUs.

*Figure 3. The JPEG decoding process employs a parallel utilization of GPU CUDA and software. A hybrid (CPU/GPU) approach for Huffman decoding overcomes the serial process stall.*

The JPEG decoding process starts with compressed JPEG bitstreams and extracts header information.

Then, Huffman decoding deals with serial processes as the DCT coefficients are decoded one at a time from the bitstream.

The next step handles the de-quantization and inverse DCT into 8×8 blocks.

Upsampling steps handle the YCbCr conversion and produce the decoded RGB image.

NVIDIA has accelerated the JPEG codec with the nvJPEG library built on CUDA technology. We developed a complete parallel implementation of the JPEG algorithm. Typical GPU-accelerated parts of the JPEG encoder and decoder workflow are shown in Figure 2 and 3.

New JPEG hardware decoder

Recently, we introduced the NVIDIA A100 GPU, which has a dedicated hardware JPEG decoder. Previously, there was no such hardware unit on datacenter GPUs, and JPEG decoding was a pure software CUDA solution that used both CPU and GPU.

Now, the hardware decoder runs simultaneously with the rest of the GPU, which can be doing various computational tasks such as image classification, object detection, and image segmentation. This provides a substantial throughput increase in more than one way, with 4-8x JPEG decode speedup compared to the NVIDIA Tesla V100.

It is exposed through the nvJPEG library, which is part of the CUDA toolkit.

nvJPEG library overview

nvJPEG is a GPU-accelerated library for JPEG codec. Together with NVIDIA DALI, a data augmentation and image loading library, it can accelerate deep learning training on image classification models by accelerating the decode and augmentation of data. The A100 includes a 5-core hardware JPEG decode engine. nvJPEG takes advantage of the hardware backend for batched processing of JPEG images.

Figure 4. The JPEG hardware decoding process employs a parallel utilization of hardware decoder and GPU CUDA software. The HW decoder is independent of the CUDA SMs so that software GPU decoders can be used simultaneously.

By selecting the hardware decoder with the nvjpegCreateEx init function, nvJPEG provides acceleration of baseline JPEG decode, and various color conversion formats (for example, YUV 420, 422, 444). As shown in Figure 4, this results in up to 20x faster image decoding compared to CPU-only processing. Users of DALI can directly benefit from this hardware acceleration because nvJPEG is abstracted.

The nvJPEG library supports the following operations:

nvJPEG Encoding
nvJPEG Transcoding
nvJPEG Decoding (includes HW (A100) support)

The library supports the following JPEG options:

Baseline and progressive JPEG encoding and decoding, baseline decoding for A100 only
8 bits per pixel
Huffman bitstream decoding
Up to four channel JPEG bitstreams
8- and 16-bit quantization tables
The following chroma subsampling for the three color channels Y, Cb, Cr (Y, U, V):
- 4:4:4
- 4:2:2
- 4:2:0
- 4:4:0
- 4:1:1
- 4:1:0

The library has the following features:

Hybrid decoding using both the CPU and the GPU.
Input to the library is in the host memory, and the output is in the GPU memory.
Single image and batched image decoding.
User-provided memory manager for the device and pinned host memory allocations.

Performance numbers

For the performance graphs in this section, we used the following test setup and GPU/CPU hardware:

NVIDIA V100 GPU: CPU – E5-2698 v4@2GHz 3.6GHz Turbo (Broadwell) HT On GPU – Tesla V100-SXM2-16GB(GV100) 116160 MiB 180 SM GPU Video Clock 1312 Batch 128 and Single Thread
NVIDIA A100 GPU CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On GPU – A100-SXM4-40GB(GA100) 140557 MiB 1108 SM GPU Video Clock 1095 Batch 128 and Single Thread
CPU: CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On TurboJPEG decode for CPU testing
Image dataset: 2K FHD = 1920 x 1080 4K UHD = 3840 x 2160 CUDA Toolkit 11.0 CUDA driver r450.24

The next two charts show the decoding speedup as a result of the hardware JPEG decoder.

*Figure 5. Graph showing the speed up achieved by hardware decode on A100 over the CUDA hybrid decode on V100.*

*Figure 6. The number of CPU threads required by the hybrid decoder on V100 to keep up with hardware decoder throughput on A100.*

By offloading the decoding to hardware, you are freeing up valuable CPU cycles that can be put to better use.

Figure 7 shows the encoding speedup.

*Figure 7a. JPEG baseline encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).*

*Figure 7b. JPEG progressive encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).*

Image decoding example

Here’s an image decoding example using nvJPEG library. This example shows the use of the hardware decoder on the A100 GPU and backend fallback for other NVIDIA GPUs.

//
// The following code example shows how to use the nvJPEG library for JPEG image decoding.
//
// Libraries used
// nvJPEG decoding
 

int main()
{
    ...
    // create nvJPEG decoder and decoder state
    nvjpegDevAllocator_t dev_allocator = {&dev_malloc, &dev_free};
    nvjpegPinnedAllocator_t pinned_allocator ={&host_malloc, &host_free};

    // Selecting A100 Hardware decoder
    nvjpegStatus_t status = nvjpegCreateEx(NVJPEG_BACKEND_HARDWARE,       
                                           &dev_allocator, 
                                           &pinned_allocator,
                                           NVJPEG_FLAGS_DEFAULT,         
                                           &params.nvjpeg_handle);
  
    params.hw_decode_available = true;
    if( status == NVJPEG_STATUS_ARCH_MISMATCH) {
        std::cout<<"Hardware Decoder not supported. Falling back to default backend"<<std::endl;
    // GPU SW decoder selected
    nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT, &dev_allocator,
                   &pinned_allocator, NVJPEG_FLAGS_DEFAULT,
                   &params.nvjpeg_handle);
    params.hw_decode_available = false;
  
   } 

    // create JPEG decoder state
    nvjpegJpegStateCreate(params.nvjpeg_handle, &params.nvjpeg_state)

    // extract bitstream metadata to figure out whether a bitstream can be decoded
    nvjpegJpegStreamParseHeader(params.nvjpeg_handle, (const unsigned char *)img_data[i].data(), img_len[i], params.jpeg_streams[0]);

    // decode Batch images
    nvjpegDecodeBatched(params.nvjpeg_handle, params.nvjpeg_state,     
                        batched_bitstreams.data(),                 
                        batched_bitstreams_size.data(), 
                        batched_output.data(), params.stream)

    ... 
}

$ git clone https://github.com/NVIDIA/CUDALibrarySamples.git 
$ cd nvJPEG/nvJPEG-Decoder/
$ mkdir build
$ cd build
$ cmake ..
$ make

// Running nvJPEG decoder
$ ./nvjpegDecoder -i ../input_images/ -o ~/tmp

Decoding images in directory: ../input_images/, total 12, batchsize 1
Processing: ../input_images/cat_baseline.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file:/tmp/cat_baseline.bmp
Processing: ../input_images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img8.bmp
Processing: ../input_images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img5.bmp
Processing: ../input_images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img7.bmp
Processing: ../input_images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: /tmp/img2.bmp
Processing: ../input_images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img4.bmp
Processing: ../input_images/cat.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file:/tmp/cat.bmp
Processing: ../input_images/cat_grayscale.jpg
Image is 1 channels.
Channel #0 size: 64 x 64
Grayscale JPEG 
Done writing decoded image to file:/tmp/cat_grayscale.bmp
Processing: ../input_images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: /tmp/img1.bmp
Processing: ../input_images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img3.bmp
Processing: ../input_images/img9.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img9.bmp
Processing: ../input_images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img6.bmp
Total decoding time: 14.8286
Avg decoding time per image: 1.23571
Avg images per sec: 0.809248
Avg decoding time per batch: 1.23571

Image resizing example

This image resizing and watermarking example generates a scaled version of images as the client requested. Figure 8 shows the image resizing and watermarking typical workflow.

*Figure 8. Image resizing and watermarking pipeline employing a parallel utilization of GPU software and CUDA.*

The following code example shows how to resize images and watermark them with a logo image.

//
// The following code example shows how to resize images and watermark them with a logo image.
//
// Libraries used 
// nvJPEG decoding, NPP Resize, NPP watermarking, nvJPEG encoding

int main()
{
    ...
    // nvJPEG decoder 
    nReturnCode = nvjpegDecode(nvjpeg_handle, nvjpeg_decoder_state, dpImage, nSize, oformat, &imgDesc, NULL);
    // NPP image resize
    st = nppiResize_8u_C3R_Ctx(imgDesc.channel[0], imgDesc.pitch[0], srcSize,   
         srcRoi, imgResize.channel[0], imgResize.pitch[0], dstSize, dstRoi,  
         NPPI_INTER_LANCZOS, nppStreamCtx);

    st = nppiResize_8u_C3R_Ctx(imgDescW.channel[0], imgDescW.pitch[0], srcSizeW, 
         srcRoiW,imgResizeW.channel[0], imgResizeW.pitch[0], dstSize, dstRoi,   
         NPPI_INTER_LANCZOS, nppStreamCtx);

    // Alpha Blending watermarking
    st = nppiAlphaCompC_8u_C3R_Ctx(imgResize.channel[0], imgResize.pitch[0], 
         255, imgResizeW.channel[0], imgResizeW.pitch[0], ALPHA_BLEND, 
         imgResize.channel[0], imgResize.pitch[0], dstSize, NPPI_OP_ALPHA_PLUS,  
         nppStreamCtx);

    // nvJPEG encoding
    nvjpegEncodeImage(nvjpeg_handle, nvjpeg_encoder_state, nvjpeg_encode_params,
         &imgResize, iformat, dstSize.width, dstSize.height,NULL));

    ... 
}

$ git clone https://github.com/NVIDIA/CUDALibrarySamples.git 
$ cd nvJPEG/Image-Resize-WaterMark/
$ mkdir build
$ cd build
$ cmake ..
$ make

// Running Image resizer and watermarking
$ ./imageResizeWatermark -i ../input_images/ -o resize_images -q 85 -rw 512 -rh 512