Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the ability to operate offline matter most.
While many existing LLM and vision language model (VLM) inference frameworks focus on data center needs such as managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.
This post introduces NVIDIA TensorRT Edge-LLM, a new, open source C++ framework for LLM and VLM inference, to solve the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.
TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear focus on embedded-specific capabilities minimizes the framework’s resource footprint.
In addition, TensorRT Edge-LLM advanced features—such as EAGLE-3 speculative decoding, NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.

LLM and VLM inference for real-time edge use cases
Edge LLM and VLM inference workloads are defined by the following characteristics:
- Requests from few users or a single user
- Low batch size, usually across cameras
- Production deployments for mission-critical applications
- Offline operation without updating
As a consequence, robotics and automotive real-time applications come with specific requirements, including:
- Minimal and predictable latency
- Minimal disk, memory, and compute requirements
- Compliance with production standards
- High robustness and reliability
TensorRT Edge-LLM is designed to fulfill and prioritize these embedded-specific needs to provide a strong foundation for embedded LLM and VLM inference.
Rapid adoption of TensorRT Edge-LLM for automotive use cases
Partners are already leveraging TensorRT Edge-LLM as a foundation for their in-car AI products, including Bosch, ThunderSoft, and MediaTek who are showcasing their technology at CES 2026.
Bosch develops the innovative Bosch AI-powered Cockpit in collaboration with Microsoft and NVIDIA, features an innovative in-car AI assistant capable of natural voice interactions. The solution uses embedded automated speech recognition (ASR) and text-to-speech (TTS) AI models in conjunction with LLM inference through TensorRT Edge-LLM for a powerful onboard AI that cooperates with larger, cloud-based AI models through a sophisticated orchestrator.
ThunderSoft integrates TensorRT Edge-LLM into its upcoming AIBOX platform, based on NVIDIA DRIVE AGX Orin, to enable responsive, on-device LLM and multimodal inference inside the vehicle. By combining the ThunderSoft automotive software stack with the TensorRT Edge-LLM lightweight C++ runtime and optimized decoding path, the AIBOX delivers low-latency conversational and cockpit-assist experiences within strict power and memory limits.
MediaTek builds on top of TensorRT Edge-LLM for their CX1 SoC that enables cutting-edge cabin AI and HMI applications. TensorRT Edge-LLM accelerates both LLM and VLM inference for a wide range of use cases, including driver and cabin activity monitoring. MediaTek contributes to the development of TensorRT Edge-LLM with new embedded-specific inference methods.
With the launch of TensorRT Edge-LLM, these LLM and VLM inference capabilities are now available for the NVIDIA Jetson ecosystem as the foundation for robotics technology.
TensorRT Edge-LLM under the hood
TensorRT Edge-LLM is designed to provide an end-to-end workflow for LLM and VLM inference. It spans three stages:
- Exporting Hugging Face models to ONNX
- Building optimized NVIDIA TensorRT engines for the target hardware
- Running inference on the target hardware

The Python export pipeline converts Hugging Face models to ONNX format with support for quantization, LoRA adapters, and EAGLE-3 speculative decoding (Figure 3).

The engine builder builds TensorRT optimized specifically for the embedded target hardware (Figure 4).

C++ Runtime is responsible for LLM and VLM inference on the target hardware. It makes use of the TensorRT engines for the decoding loop that defines autoregressive models: iterative, token generation based on input and previously generated tokens. User applications interface with this runtime to solve LLM and VLM workloads.

For a more detailed explanation of the components, see the TensorRT Edge-LLM documentation.
Get started with TensorRT Edge-LLM
Ready to get started with LLM and VLM inference on your Jetson AGX Thor DevKit?
1. Download the JetPack 7.1 release.
2. Clone the JetPack 7.1 release branch of the NVIDIA/TensorRT-Edge-LLM GitHub repo:
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
3. Check the TensorRT Edge-LLM Quick Start Guide for detailed instructions on getting out-of-the-box supported models from Hugging Face, converting them to ONNX, building TensorRT engines for your Jetson AGX Thor platform, and running them with the C++ runtime.
4. Explore the TensorRT Edge-LLM examples to learn more about features and capabilities.
5. See the TensorRT Edge-LLM Customization Guide To adapt TensorRT Edge-LLM to your own needs.
For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is part of the NVIDIA DriveOS release package. DriveOS releases will leverage the GitHub repo in upcoming releases.
As LLMs and VLMs move rapidly to the edge, TensorRT Edge-LLM provides a clean, reliable path from Hugging Face models to real-time, production-grade execution on NVIDIA automotive and robotics platforms.
Explore the workflow, test your models, and begin building the next generation of intelligent on-device applications. To learn more, visit the NVIDIA/TensorRT-Edge-LLM GitHub repo.
Acknowledgments
Thank you to Michael Ferry, Nicky Liu, Martin Chi, Ruocheng Jia, Charl Li, Maggie Hu, Krishna Sai Chemudupati, Frederik Kaster, Xiang Guo, Yuan Yao, Vincent Wang, Levi Chen, Chen Fu, Le An, Josh Park, Xinru Zhang, Chengming Zhao, Sunny Gai, Ajinkya Rasani, Zhijia Liu, Ever Wong, Wenting Jiang, Jonas Li, Po-Han Huang, Brant Zhao, Yiheng Zhang, and Ashwin Nanjappa for your contributions to and support of TensorRT Edge-LLM.