Developer Tools & Techniques

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the ability to operate offline matter most.

While many existing LLM and vision language model (VLM) inference frameworks focus on data center needs such as managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.

This post introduces NVIDIA TensorRT Edge-LLM, a new, open source C++ framework for LLM and VLM inference, to solve the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.

TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear focus on embedded-specific capabilities minimizes the framework’s resource footprint.

In addition, TensorRT Edge-LLM advanced features—such as EAGLE-3 speculative decoding, NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.

A green bar chart shows TensorRT Edge-LLM performance for newer Qwen3 LLM and VLM models. Configurations where speculative decoding is enabled show substantially better performance.
Figure 1. TensorRT Edge-LLM shows compelling performance using Qwen3 with speculative decoding

LLM and VLM inference for real-time edge use cases

Edge LLM and VLM inference workloads are defined by the following characteristics:

  • Requests from few users or a single user 
  • Low batch size, usually across cameras
  • Production deployments for mission-critical applications
  • Offline operation without updating

As a consequence, robotics and automotive real-time applications come with specific requirements, including:

  • Minimal and predictable latency
  • Minimal disk, memory, and compute requirements
  • Compliance with production standards
  • High robustness and reliability

TensorRT Edge-LLM is designed to fulfill and prioritize these embedded-specific needs to provide a strong foundation for embedded LLM and VLM inference.

Rapid adoption of TensorRT Edge-LLM for automotive use cases

Partners are already leveraging TensorRT Edge-LLM as a foundation for their in-car AI products, including Bosch, ThunderSoft, and MediaTek who are showcasing their technology at CES 2026.

Bosch develops the innovative Bosch AI-powered Cockpit in collaboration with Microsoft and NVIDIA, features an innovative in-car AI assistant capable of natural voice interactions. The solution uses embedded automated speech recognition (ASR) and text-to-speech (TTS) AI models in conjunction with LLM inference through TensorRT Edge-LLM for a powerful onboard AI that cooperates with larger, cloud-based AI models through a sophisticated orchestrator. 

ThunderSoft integrates TensorRT Edge-LLM into its upcoming AIBOX platform, based on NVIDIA DRIVE AGX Orin, to enable responsive, on-device LLM and multimodal inference inside the vehicle. By combining the ThunderSoft automotive software stack with the TensorRT Edge-LLM lightweight C++ runtime and optimized decoding path, the AIBOX delivers low-latency conversational and cockpit-assist experiences within strict power and memory limits. 

MediaTek builds on top of TensorRT Edge-LLM for their CX1 SoC that enables cutting-edge cabin AI and HMI applications. TensorRT Edge-LLM accelerates both LLM and VLM inference for a wide range of use cases, including driver and cabin activity monitoring. MediaTek contributes to the development of TensorRT Edge-LLM with new embedded-specific inference methods.

With the launch of TensorRT Edge-LLM, these LLM and VLM inference capabilities are now available for the NVIDIA Jetson ecosystem as the foundation for robotics technology.

TensorRT Edge-LLM under the hood

TensorRT Edge-LLM is designed to provide an end-to-end workflow for LLM and VLM  inference. It spans three stages: 

  • Exporting Hugging Face models to ONNX
  • Building optimized NVIDIA TensorRT engines for the target hardware
  • Running inference on the target hardware
A flowchart showing how on x86 host computers, Hugging Face models are the input of the Python Export Pipeline, which produces ONNX models as an output. On the target, these ONNX models are used by the Engine Builder to build TensorRT Engines. These engines are then used by the LLM Runtime to produce inference results for users’ applications.
Figure 2. TensorRT Edge-LLM workflow with key components

The Python export pipeline converts Hugging Face models to ONNX format with support for quantization, LoRA adapters, and EAGLE-3 speculative decoding (Figure 3).

A flowchart showing the quantization and export tools the TensorRT Edge-LLM Python Export Pipeline provides for different HuggingFace models. For base/vanilla models, quantize llm, export-llm and insert-lora are provided. Export-llm generates the Base ONNX model while insert-lora generates the LoRA-enabled ONNX model. For LoRA weights, the process-LoRA tool provides SafeTensors. For EAGLE draft models, quantize-draft and export-draft create the EAGLE Draft ONNX model. For Vision Transformers, the export-visual tools take care of both quantization and export to provide an ONNX model as output.
Figure 3. TensorRT Edge-LLM Python export pipeline stages and tools

The engine builder builds TensorRT optimized specifically for the embedded target hardware (Figure 4).

A flowchart showing how ONNX Models and Export Configs are processed by the TensorRT Edge-LLM Engine Builder. Depending on whether the model is an LLM or VLM, the TensorRT Edge-LLM LLM Builder or VIT Builder will be used.
Figure 4. TensorRT Edge-LLM engine builder workflow

C++ Runtime is responsible for LLM and VLM inference on the target hardware. It makes use of the TensorRT engines for the decoding loop that defines autoregressive models: iterative, token generation based on input and previously generated tokens. User applications interface with this runtime to solve LLM and VLM workloads.

A flowchart showing the prefill phase and decode phase of TensorRT Edge-LLM C++ Runtime. Based on tokenized input prompt, the TRT engine runs and provides logits across possible output tokens. KV Cache is then generated and a first token is chosen through sampling. The runtime then enters the decoding phase, where the TensorRT Engine is used to generate the next logits followed by KV cache update and token sampling. Then it is checked whether the stop condition (EOS token) is met; if no, the loop continues with a TRT Engine call, if yes the generated sequence is returned.
Figure 5. Prefill and decode phases of TensorRT Edge-LLM C++ Runtime

For a more detailed explanation of the components, see the TensorRT Edge-LLM documentation.

Get started with TensorRT Edge-LLM

Ready to get started with LLM and VLM inference on your Jetson AGX Thor DevKit?

1. Download the JetPack 7.1 release.

2. Clone the JetPack 7.1 release branch of the NVIDIA/TensorRT-Edge-LLM GitHub repo:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git

3. Check the TensorRT Edge-LLM Quick Start Guide for detailed instructions on getting out-of-the-box supported models from Hugging Face, converting them to ONNX, building TensorRT engines for your Jetson AGX Thor platform, and running them with the C++ runtime.

4. Explore the TensorRT Edge-LLM examples to learn more about features and capabilities.

5. See the TensorRT Edge-LLM Customization Guide To adapt TensorRT Edge-LLM to your own needs.

For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is part of the NVIDIA DriveOS release package. DriveOS releases will leverage the GitHub repo in upcoming releases.

As LLMs and VLMs move rapidly to the edge, TensorRT Edge-LLM provides a clean, reliable path from Hugging Face models to real-time, production-grade execution on NVIDIA automotive and robotics platforms.

Explore the workflow, test your models, and begin building the next generation of intelligent on-device applications. To learn more, visit the NVIDIA/TensorRT-Edge-LLM GitHub repo.

Acknowledgments 

Thank you to Michael Ferry, Nicky Liu, Martin Chi, Ruocheng Jia, Charl Li, Maggie Hu, Krishna Sai Chemudupati, Frederik Kaster, Xiang Guo, Yuan Yao, Vincent Wang, Levi Chen, Chen Fu, Le An, Josh Park, Xinru Zhang, Chengming Zhao, Sunny Gai, Ajinkya Rasani, Zhijia Liu, Ever Wong, Wenting Jiang, Jonas Li, Po-Han Huang, Brant Zhao, Yiheng Zhang, and Ashwin Nanjappa for your contributions to and support of TensorRT Edge-LLM. 

Discuss (0)

Tags