Making Robot Perception More Efficient on NVIDIA Jetson Thor

Building autonomous robots requires robust, low-latency visual perception for depth, obstacle recognition, localization, and navigation in dynamic environments. These capabilities demand heavy compute. NVIDIA Jetson platforms offer powerful GPUs for deep learning, but increasing AI complexity and the need for real-time performance can lead to GPU oversubscription. Relying solely on the GPU for all perception tasks can result in bottlenecks, increased power consumption, and thermal challenges, especially in power-sensitive and thermally constrained environments common in mobile robotics.

The NVIDIA Jetson platform addresses these challenges by combining powerful GPUs with dedicated hardware accelerators. Jetson devices like NVIDIA Jetson AGX Orin and NVIDIA Jetson Thor house specialty hardware accelerators designed to execute image processing and computer-vision tasks with high efficiency. That frees up the GPU for more demanding deep-learning workloads. The NVIDIA Vision Programming Interface (VPI) unlocks the full potential of these diverse hardware accelerators.

In this blog, we explore the benefits of using these accelerators and explain how developers can use VPI to unlock the full potential of the Jetson platform. As an example, we will walk you through the development of a low-latency, low-powered perception application for stereo disparity using these accelerators. To start, we will develop a single stereo camera pipeline, and then move onto developing a multi-stream pipeline with eight stereo cameras performing at 30FPS on Thor T5000—about 10x faster than Orin AGX 64 GB.

Before we jump into development, let’s quickly look over what accelerators are available on the Jetson platform, their benefits, what applications they can unlock, and how VPI can help.

What accelerators does Jetson offer beyond the GPU?

Jetson devices have powerful GPUs for deep learning, but increasing AI complexity demands better GPU cycle management. Jetson offers specialized engines for computer vision (CV) workloads. While the GPU is powerful and flexible, these engines, when combined with the GPU, offer significant computational advantages. VPI simplifies access to these engines, making experimentation and load-balancing easy.

Diagram showing the relationship between Jetson hardware, VPI operators, and developer applications. — *Figure 1. Vision Programming Interface (VPI) for Jetson developers*

Let’s look at each accelerator closely to understand its purpose and benefits.

Programmable Vision Accelerator (PVA):

The PVA is a programmable digital signal processing (DSP) engine with a 1024‑bit single-instruction-multiple data (SIMD) unit and local memory with flexible direct memory access (DMA), optimized for vision and image processing with high performance per watt. It runs asynchronously alongside the CPU, GPU, and other accelerators, and is available on all Jetson SKUs except NVIDIA Jetson Nano.

Through VPI, developers can access ready‑to‑use algorithms like AprilTag detection, object tracking, and stereo disparity estimation. For custom implementation of algorithms, the PVA SDK, now available to Jetson developers, provides C/C++ APIs and tools for developing vision algorithms directly on the PVA.

Optical Flow Accelerator (OFA):

The OFA is a fixed-function hardware accelerator for computing optical flow and stereo disparity from stereo camera pairs. OFA can operate in two modes: In stereo disparity mode, the OFA estimates a disparity map by processing rectified left and right view from a camera pair. In Optical Flow mode, the OFA estimates 2D motion vectors between two frames.

Video and Image Compositor (VIC):

The VIC is a fixed-function, power efficient hardware accelerator in Jetson devices that is specialized for low-level image processing tasks, such as rescaling, remapping, warping, color space conversion and noise reduction.

What use cases benefit from these accelerators?

Below are some scenarios where developers may consider going beyond the GPU for their specific application needs:

GPU-oversubscribed applications: As best practice, developers should prioritize deep learning (DL) workloads for the GPU, and offload computer-vision tasks to PVA, OFA, or VIC using VPI. For example, DeepStream’s Multi‑Object Tracker can run 12 video streams on Orin AGX with GPU alone, but by load balancing with PVA it can support 16 streams.
Power‑sensitive applications: In use cases like sentry mode or activity monitoring, offloading most computation to low‑power accelerators (PVA, OFA, VIC) can provide maximum efficiency.
Industrial applications with thermal limits: In high‑heat environments, distributing workloads across all accelerators reduces throttling and helps maintain latency and throughput within thermal budgets.

How to use VPI to unlock all the accelerators

VPI provides a unified and flexible framework that gives developers access to accelerators seamlessly on platforms ranging from Jetson modules to workstations or PCs with discrete GPUs.

Now let’s look at an example that brings it all together.

Example: stereo vision pipeline

Modern robotics stacks often rely on passive stereo systems for 3D perception of the surrounding world. As a result, computing stereo disparity maps is an essential step toward building a complex perception stack. Here we will take a look at a sample pipeline that a developer can use to produce stereo disparity and confidence maps. Below we show how to construct a low-latency, and energy efficient pipeline with all the accelerators available via VPI.

A sample stereo vision pipeline using multiple Jetson accelerators — *Figure 2. Schematic of Stereo Vision Pipeline deployed across multiple accelerators on Jetson. PVA = Programmable Vision Array. VIC = Video and Image Compositor. OFA = Optical Flow Accelerator.*

Preprocessing on CPU: The preprocessing step can run on a CPU as it only happens once. This step computes a rectification map that will be used for correcting lens distortion from the stereo camera frames.
Remap on VIC: This step undistorts and aligns camera frames using a precomputed rectification map, ensuring both optical axes are level and parallel. VPI supports polynomial and fisheye distortion models and lets developers define custom warp maps. See the Remap documentation for details.
Stereo disparity on OFA: The rectified image pairs are inputs for the semi-global matching (SGM) algorithm. In practice, SGM alone can be noisy and produce erroneous disparity values. A confidence map can be created to improve the result by discarding disparity estimates that correspond to low confidence values. For more details on SGM and the supported parameters, refer to stereo disparity documentation.
Confidence map on PVA: VPI supports three confidence map modes: ABSOLUTE, RELATIVE, and INFERENCE. ABSOLUTE and RELATIVE require two OFA passes (left/right disparity) plus a PVA cross‑check, while INFERENCE uses a single OFA pass followed by a CNN on PVA (two convolution + two non-linear activation layers). Skipping confidence computation is fastest but produces noisy disparity maps, whereas RELATIVE and INFERENCE improve both disparity quality and confidence.

VPI’s unified memory architecture eliminates unnecessary data copies across engines, and its asynchronous stream/event model lets developers schedule workloads and sync points in advance. Hardware‑managed scheduling enables parallel execution across engines, freeing the CPU and hiding latency with an efficient streaming pipeline.

Building a high-performance stereo disparity pipeline using VPI

Getting started with Python APIs

This tutorial walks through a basic stereo disparity pipeline without remap using the VPI Python API.

Prerequisites:

An NVIDIA Jetson device (e.g., Jetson AGX Thor)
VPI installed via NVIDIA SDK Manager or apt
Python libraries: vpi, numpy, Pillow, opencv-python

In this tutorial, we will:

Load left and right stereo images
Convert their format for processing
Synchronize the streams to ensure data is ready
Execute the stereo disparity algorithm
Post-process the output and save the result

Setup and initialization

The first step is to import the necessary libraries and create VPIStream objects. A VPIStream acts as a command queue, allowing you to submit tasks for asynchronous execution. We will use two streams to demonstrate parallel processing.

import vpi
import numpy as np
from PIL import Image
from argparse import ArgumentParser
 
# Create two streams for parallel processing
streamLeft = vpi.Stream()
streamRight = vpi.Stream()

The streamLeft will handle the left image, and streamRight will handle the right image.

Loading and converting imagesVPI’s Python API can work directly with NumPy arrays. We load the images using Pillow and then wrap them in VPI’s asimage function. Next, we convert the images to a format suitable for the stereo disparity algorithm. For this example, we’ll convert from RGBA8 to Y8_ER_BL (8-bit grayscale, block-linear format).

# Load images and wrap them in VPI images
left_img = np.asarray(Image.open(args.left))
right_img = np.asarray(Image.open(args.right))
left = vpi.asimage(left_img)
right = vpi.asimage(right_img)
 
# Convert images to Y8_ER_BL format in parallel on different backends
left = left.convert(vpi.Format.Y8_ER_BL, scale=1, stream=streamLeft, backend=vpi.Backend.VIC)
right = right.convert(vpi.Format.Y8_ER_BL, scale=1, stream=streamRight, backend=vpi.Backend.CUDA)

The left image conversion is submitted to the VIC backend via streamLeft, while the right image conversion is submitted to the NVIDIA CUDA backend on streamRight. This allows the two operations to run in parallel on different hardware units, which is a key advantage of VPI.

Synchronizing and executing stereo disparity

Before we can perform stereo disparity, we must ensure that both images are ready. We use streamLeft.sync() to block the main thread until the left image conversion is complete. Then, we can submit the vpi.stereodisp operation on streamRight.

# Synchronize streamLeft to ensure the left image is ready
streamLeft.sync()
 
# Submit the stereo disparity operation on streamRight
disparityS16 = vpi.stereodisp(left, right, backend=vpi.Backend.OFA|vpi.Backend.PVA|vpi.Backend.VIC, stream=streamRight)

The stereo disparity algorithm is executed on a combination of VPI backends (OFA, PVA, VIC) to take advantage of specialized hardware. The result is a disparity map in S16 format, representing the horizontal shift between corresponding pixels in the two images.

Post-processing and visualization

The raw disparity map needs to be post-processed for visualization. The disparity values, which are in Q10.5 fixed-point format, are scaled to a 0-255 range and saved.

# Post-process the disparity map 
# Convert Q10.5 to  U8 and scale for visualization
disparityU8 = disparityS16.convert(vpi.Format.U8, scale=255.0/(32*128), stream=streamRight, backend=vpi.Backend.CUDA)
 
# make accessible in cpu
disparityU8 = disparityU8.cpu()
 
#save with pillow
d_pil = Image.fromarray(disparityU8)
d_pil.save('./disparity.png')

This final step converts the raw data into a human-readable image, where grayscale represents depth.

Multi-Streaming disparity pipeline using C++ APIs

Advanced robotics need high throughput, which VPI enables through parallel multi‑streaming. By combining streamlined APIs with efficient use of hardware accelerators, VPI lets developers build fast, reliable vision pipelines—similar to those powering Boston Dynamics’ next‑gen robots.

VPI uses VPIStream objects, which are first-in-first-out (FIFO) command queues for submitting tasks to a backend asynchronously. This allows for parallel execution of operations on different hardware units (asynchronous streams).

For maximum performance in mission-critical applications, VPI’s C++ API is ideal.

The following code snippets are from a C++ benchmark that demonstrates how to build and run a multi-stream stereo disparity pipeline. The SimpleMultiStreamBenchmark C++ app showcases this by pre‑generating synthetic NV12_BL images to avoid runtime overhead, then running multiple streams in parallel and measuring frames per second (FPS) throughput. It also supports saving inputs and disparity/confidence maps for debugging. This example pre-generates data to simulate a high-speed, real-time workload.

Setting up resources, object declaration, and initialization

We first declare and initialize all of the objects VPI requires to perform this pipeline per stream. This includes creating streams, input/output images, and stereo payloads. Since we will feed images of type NV12_BL to the stereo algorithm, we allocated that type and the Y8_ER image type for intermediate format conversion.

int totalIterations = itersPerStream * numStreams;
std::vector<VPIImage> leftInputs(numStreams), rightInputs(numStreams), confidences(numStreams), leftTmps(numStreams), rightTmps(numStreams);
std::vector<VPIImage> leftOuts(numStreams), rightOuts(numStreams), disparities(numStreams);
std::vector<VPIPayload> stereoPayloads(numStreams);
std::vector<VPIStream> streamsLeft(numStreams), streamsRight(numStreams);
std::vector<VPIEvent> events(numStreams);
int width   = cvImageLeft.cols;
int height  = cvImageLeft.rows;
int vic_pva_ofa = VPI_BACKEND_VIC | VPI_BACKEND_OFA | VPI_BACKEND_PVA;
VPIStereoDisparityEstimatorCreationParams stereoPayloadParams;
VPIStereoDisparityEstimatorParams stereoParams;
CHECK_STATUS(vpiInitStereoDisparityEstimatorCreationParams(&stereoPayloadParams));
CHECK_STATUS(vpiInitStereoDisparityEstimatorParams(&stereoParams));
stereoPayloadParams.maxDisparity = 128;
stereoParams.maxDisparity= 128;
stereoParams.confidenceType  = VPI_STEREO_CONFIDENCE_RELATIVE;

for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiImageCreateWrapperOpenCVMat(cvImageLeft, 0, &leftInputs[i]));
    CHECK_STATUS(vpiImageCreateWrapperOpenCVMat(cvImageRight, 0, &rightInputs[i]));
    CHECK_STATUS(vpiStreamCreate(0, &streamsLeft[i]));
    CHECK_STATUS(vpiStreamCreate(0, &streamsRight[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_Y8_ER, 0, &leftTmps[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_NV12_BL, 0, &leftOuts[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_Y8_ER, 0, &rightTmps[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_NV12_BL, 0, &rightOuts[i]));
    CHECK_STATUS(vpiCreateStereoDisparityEstimator(vic_pva_ofa, width, height, VPI_IMAGE_FORMAT_NV12_BL,
    &stereoPayloadParams, &stereoPayloads[i]));
    CHECK_STATUS(vpiEventCreate(0, &events[i]));
}
int outCount = saveOutput ? (numStreams * itersPerStream) : numStreams;
disparities.resize(outCount);
confidences.resize(outCount);
for (int i = 0; i < outCount; i++)
{
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_S16, 0, &disparities[i]));
    CHECK_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_U16, 0, &confidences[i]));
}

Converting image format

We use VPI’s C API to submit our image conversion operations for each stream to convert to NV12_BL input mimicking frames from the camera.

for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsLeft[i], VPI_BACKEND_CPU, leftInputs[i], leftTmps[i], NULL));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsLeft[i], VPI_BACKEND_VIC, leftTmps[i], leftOuts[i], NULL));
    CHECK_STATUS(vpiEventRecord(events[i], streamsLeft[i]));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsRight[i], VPI_BACKEND_CPU, rightInputs[i], rightTmps[i], NULL));
    CHECK_STATUS(vpiSubmitConvertImageFormat(streamsRight[i], VPI_BACKEND_VIC, rightTmps[i], rightOuts[i], NULL));
    CHECK_STATUS(vpiStreamWaitEvent(streamsRight[i], events[i]));
}
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsLeft[i]));
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
}

Same as before, we submit the operations to different hardware on the two separate streams. The types are inferred from the types of the input/output images. This time, we also record a VPIEvent after the left stream’s conversion operations. A VPIEvent is a VPI object that enables a stream to wait for another stream to complete all of the operations at the time of recording. This allows us to force the right stream to wait on the left stream’s conversion operation without blocking the calling (main) thread, thus enabling multiple leftStreams and rightStreams to operate in parallel.

Synchronizing and executing stereo disparity

We use VPI’s C API to submit our stereo disparity operation. We also benchmark our stereo disparity using std::chrono.

auto benchmarkStart = std::chrono::high_resolution_clock::now();
for (int iter = 0; iter < itersPerStream; iter++)
{
    for (int i = 0; i < numStreams; i++)
    {
        int dispIdx = saveOutput ? (i * itersPerStream + iter) : i;
        CHECK_STATUS(vpiSubmitStereoDisparityEstimator(streamsRight[i], vic_pva_ofa, stereoPayloads[i], leftOuts[i],
                                                     rightOuts[i], disparities[dispIdx], confidences[dispIdx],
                                                     &stereoParams));
    }
}
// ====================
// End Benchmarking
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
}
auto benchmarkEnd = std::chrono::high_resolution_clock::now();

Same as before, we submit our operation with a confidenceMap and get a resulting disparityMap. We also end our benchmarking timer and record the time taken to do conversion and disparity. We explicitly sync all of the streams after submitting to all of them to ensure that the calling thread is never blocked at submission time.

Post-processing and cleanup

We use VPI’s C API and OpenCV interoperability to postprocess and save the disparity map within the same per iteration loop. We optionally save the output data for inspection and then clean up the objects after the loop.

// ====================
// Save Outputs
if (saveOutput)
{
    for (int i = 0; i < numStreams * itersPerStream; i++)
    {
        VPIImageData dispData, confData;
        cv::Mat cvDisparity, cvDisparityColor, cvConfidence, cvMask;
        CHECK_STATUS(
        vpiImageLockData(disparities[i], VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &dispData));
        vpiImageDataExportOpenCVMat(dispData, &cvDisparity);
        cvDisparity.convertTo(cvDisparity, CV_8UC1, 255.0 / (32 * stereoParams.maxDisparity), 0);
        applyColorMap(cvDisparity, cvDisparityColor, cv::COLORMAP_JET);
        CHECK_STATUS(vpiImageUnlock(disparities[i]));
        std::ostringstream fpStream;
        fpStream << "stream_" << i / itersPerStream << "_iter_" << i % itersPerStream << "_disparity.png";
        imwrite(fpStream.str(), cvDisparityColor);

        // Confidence output (U16 -> scale to 8-bit and save)
        CHECK_STATUS(
        vpiImageLockData(confidences[i], VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &confData));
        vpiImageDataExportOpenCVMat(confData, &cvConfidence);
        cvConfidence.convertTo(cvConfidence, CV_8UC1, 255.0 / 65535.0, 0);
        CHECK_STATUS(vpiImageUnlock(confidences[i]));
        std::ostringstream fpStreamConf;
        fpStreamConf << "stream_" << i / itersPerStream << "_iter_" << i % itersPerStream << "_confidence.png";
        imwrite(fpStreamConf.str(), cvConfidence);
    }
}

// ====================
// Clean Up VPI Objects
for (int i = 0; i < numStreams; i++)
{
    CHECK_STATUS(vpiStreamSync(streamsLeft[i]));
    CHECK_STATUS(vpiStreamSync(streamsRight[i]));
    vpiStreamDestroy(streamsLeft[i]);
    vpiStreamDestroy(streamsRight[i]);
    vpiImageDestroy(rightInputs[i]);
    vpiImageDestroy(leftInputs[i]);
    vpiImageDestroy(leftTmps[i]);
    vpiImageDestroy(leftOuts[i]);
    vpiImageDestroy(rightTmps[i]);
    vpiImageDestroy(rightOuts[i]);
    vpiPayloadDestroy(stereoPayloads[i]);
    vpiEventDestroy(events[i]);
}
// Destroy all disparity and confidence images
for (int i = 0; i < (int)disparities.size(); i++)
{
    vpiImageDestroy(disparities[i]);
}
for (int i = 0; i < (int)confidences.size(); i++)
{
    vpiImageDestroy(confidences[i]);
}

Collect benchmarking results

We can now collect and display our benchmarking results.

double totalTimeSeconds = totalTime / 1000000.0;
double avgTimePerFrame  = totalTimeSeconds / totalIterations;
double throughputFPS= totalIterations / totalTimeSeconds;

std::cout << "\n" << std::string(70, '=') << std::endl;
std::cout << "SIMPLE MULTI-STREAM RESULTS" << std::endl;
std::cout << std::string(70, '=') << std::endl;
std::cout << "Input: RGB8 -> Y8_BL_ER" << std::endl;
std::cout << "Total time: " << totalTimeSeconds << " seconds" << std::endl;
std::cout << "Avg time per frame: " << (avgTimePerFrame * 1000) << " ms" << std::endl;
std::cout << "THROUGHPUT: " << throughputFPS << " FPS" << std::endl;
std::cout << std::string(70, '=') << std::endl;

std::cout << "THROUGHPUT: " << throughputFPS << " FPS" << std::endl;
std::cout << std::string(70, '=') << std::endl;

Review results

Given an image resolution of 960×600 and maximum disparity of 128, this solution achieves 30 FPS with eight simultaneous streams running stereo disparity estimation, including confidence maps on Thor T5000 without any load on GPU. This is about 10-times faster than on an Orin AGX 64 GB. The power setting is MAX_N for both cases. Performance is shown in Table 1.

Stereo disparity full pipeline (RELATIVE mode, res: 960×600, max disparity: 128)
	Frame Rate (FPS)		Speed-up ratio
Number of streams	Orin AGX (64 GB)	Jetson Thor T5000
1	22	122	5.5
2	12	111	9.5
4	6	58	9.7
8	3	29	9.7

Table 1. Comparison of stereo disparity pipeline in RELATIVE mode on Orin AGX vs. Thor T5000

How Boston Dynamics uses the VPI

As heavy users of the Jetson platform, Boston Dynamics relies on the Vision Programming Interface (VPI) to accelerate its perception pipeline.

VPI enables seamless access to Jetson’s specialized hardware accelerators, offering a suite of optimized vision algorithms such as AprilTags and SGM disparity, and feature detectors like ORB, Harris Corner, Pyramidal LK, and OFA-powered optical flow. These are core to Boston Dynamics’ perception stack, supporting both prototype testing and system optimization through load balancing. By adopting VPI, engineers can quickly adapt to hardware updates and shorten time‑to‑value.

Takeaways

The advancements in hardware capabilities in the Jetson Thor platform and libraries like VPI empower developers to design efficient, low-latency solutions for edge-based robotics.

By utilizing the unique features of each available accelerator on Jetson, robotics companies such as Boston Dynamics can achieve sophisticated vision processing that is both efficient and scalable, a key step in making intelligent, autonomous robots a reality in various real-world applications.

To get started with building your own CV applications on Jetson, check out the following: