Developer Tools & Techniques

Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

NVIDIA TensorRT Edge‑LLM introduces support for MoEs, Cosmos Reason 2, and Qwen3-TTS/ASR on NVIDIA Jetson and NVIDIA DRIVE

Mar 12, 2026

By Lin Chai, Luxiao Zheng, Fan Shi, Maximilien Breughe and Michael Ferry

Discuss (0)

AI-Generated Summary

Dislike

The latest release of NVIDIA TensorRT Edge-LLM introduces advanced support for mixture of experts (MoE), hybrid reasoning architectures, and the NVIDIA Nemotron family on embedded platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor, enabling high-fidelity, low-latency autonomous machine intelligence within strict power constraints.
Native multimodal interaction is achieved through optimized Qwen3-TTS and Qwen3-ASR models, allowing end-to-end, low-latency voice dialogue with a Thinker-Talker framework, and Cosmos Reason 2 enables advanced spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robotics and embodied agents at the edge.
NVIDIA Alpamayo integration supports end-to-end trajectory planning in autonomous vehicles, employing flow matching trajectory decoding, explainable decision-making with multicamera context, and FP8-accelerated Vision Transformers, marking a shift from modular stacks to production-ready, reasoning-based VLA models.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Physical AI is rapidly evolving, from next-generation software-defined autonomous vehicles (AVs) to humanoid robots. The challenge is no longer how to run a large language model (LLM), but how to enable high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within strict power and latency envelopes.

NVIDIA TensorRT Edge-LLM, a high-performance C++ inference runtime for LLMs and vision language models (VLMs) on embedded platforms, is designed to overcome these challenges.

As explained in this post, the latest TensorRT Edge-LLM release delivers a significant expansion in fundamental capabilities for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms. It introduces advanced edge architectures, including mixture of experts (MoE), the NVIDIA Cosmos Reason 2 open planning model for physical AI, and Qwen3-TTS and Qwen-ASR models for embedded speech processing. Building on these foundational pillars, the release also offers optimized support for the NVIDIA Nemotron family of open models. This provides developers with the essential runtime to build the next generation of autonomous machines.

Efficient reasoning at scale

Running massive models on embedded hardware requires a rethink of compute efficiency. The latest release of TensorRT Edge-LLM fully enables MoE support at the edge, specifically optimizing models like Qwen3 MoE. By activating only a subset of expert parameters per token, MoE architectures enable edge devices to access the reasoning capabilities of a massive model while maintaining the inference latency and active compute footprint of a much smaller one.

This architectural shift is critical for deploying high-fidelity reasoning on edge platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. As a developer, you can drastically scale up the intelligence of your autonomous systems without exceeding the strict power and latency limits required for real-time, mission-critical operations.

Unlock hybrid reasoning at the edge

TensorRT Edge-LLM is a specialized runtime to fully support NVIDIA Nemotron 2 Nano. This enables a new class of System 2 reasoning directly on embedded chipsets, including NVIDIA DRIVE Thor and Jetson Thor.

For developers building advanced in-cabin AI assistants or robotic dialogue agents, deploying highly capable language models at the edge presents a significant memory and latency challenge. Nemotron 2 Nano addresses this challenge fundamentally by utilizing a novel Hybrid Mamba-2-Transformer architecture. This significantly reduces the memory footprint from KV cache storage with Mamba State Space architectures while maintaining high-fidelity precision from attention layers.

TensorRT Edge-LLM bridges the deployment gap by providing optimized kernels that accelerate these specific hybrid layers. This enables developers to use the model’s massive context window for complex edge retrieval-augmented generation (RAG) pipelines or agentic workflows while maintaining a strict, production-viable device memory footprint.

By enabling dynamic “thinking” at the edge with TensorRT Edge-LLM, developers can leverage a model’s ability to shift seamlessly between deep reasoning and immediate conversational action. This is a critical capability for advanced in-cabin assistants and robotic agents that must reason through complex user queries one moment and provide conversational responses the next.

Deep reasoning mode (/think): TensorRT Edge-LLM efficiently handles the expanded token generation required for chain of thought (CoT) processing. By using the /think system prompt, the runtime enables the model to think through complex logic, achieving a remarkable 97.8% on MATH500—before outputting a decision.
Conversational reflex mode (/no_think): For latency-critical voice interactions where the user expects an immediate reply, developers can issue a /no_think command. TensorRT Edge-LLM optimizes this path to bypass reasoning traces, delivering immediate, intelligent responsiveness required for seamless conversational AI and agile on-device agents.

By supporting this hybrid architecture, TensorRT Edge-LLM enables compact, production-ready VLMs and LLMs to serve as both reasoned assistants and low-latency conversational agents, significantly reducing the memory constraints of physical AI.

Real-time multimodal interaction at the edge

TensorRT Edge-LLM now offers support for Qwen3-TTS and Qwen3-ASR, a native multimodal model with Thinker-Talker architecture capable of voice interaction. Unlike traditional pipelines that cascade ASR, LLM, and TTS models, adding latency at every hop, Qwen3-TTS/ASR handles end-to-end speech processing.

By optimizing both the Thinker and Talker components, TensorRT Edge-LLM enables low-latency, natural voice synthesis directly on the chip:

Thinker: TensorRT Edge-LLM accelerates the reasoning core, allowing the model to process complex driver queries and environment context to generate intelligent, reasoned responses.
Talker: TensorRT Edge-LLM complements the reasoning engine by delivering low latency, natural voice synthesis (TTS) directly on the chip.

In the case of AVs, this allows for seamless, interruptible conversations between the driver and the vehicle.

Equipping humanoid robotics with physical common sense

For humanoid robots and advanced vision agents, understanding the real world requires more than just identifying objects; it requires an intuitive grasp of physics and time. To meet this need, TensorRT Edge-LLM now supports Cosmos Reason 2, an open, customizable reasoning VLM purpose-built for physical AI and robotics.

Cosmos Reason 2 empowers embodied agents to reason like humans by using prior knowledge, physical common sense, and chain-of-thought capabilities to understand world dynamics without human annotations. With TensorRT Edge-LLM optimized, low-latency runtime, robots at the edge can efficiently leverage Cosmos Reason 2 as a primary planning model to reason through their next steps.

Key capabilities of Cosmos Reason 2 accelerated by TensorRT Edge-LLM include:

Advanced spatio-temporal reasoning: Enhanced physical AI reasoning with improved timestamp precision and a deep understanding of space, time, and fundamental physics.
3D localization and explanation: The ability to not only detect objects but also provide 2D and 3D point localization, bounding-box coordinates, and contextual reasoning explanations for its labels.
Massive context processing: Support for an improved long-context window of up to 256K input tokens, allowing edge agents to ingest extensive environmental and historical data.

By supporting Cosmos Reason 2, TensorRT Edge-LLM ensures that next-generation robots can continuously evaluate complex, long-tail physical scenarios and safely plan their actions in real time.

Advancing autonomous driving with end-to-end trajectory planning

Among the most significant shifts in autonomous production is the move from traditional modular stacks to end-to-end VLA models. NVIDIA Alpamayo is a family of open AI models, simulation frameworks, and physical AI datasets designed to accelerate the development of safe, transparent, and reasoning-based AVs.

Stay tuned for the forthcoming Alpamayo 1 workflow, a distillation recipe that brings System 2 rational thinking to the edge. Alpamayo 1 represents a leap forward from standard VLMs. It is not just describing a scene; it is planning a precise trajectory through it. The architecture utilizes a Cosmos Reason Backbone (distilled) to generate a chain of causation (reasoning trace) before outputting actions.

Key features of the Alpamayo integration in TensorRT Edge-LLM include:

Flow matching trajectory decoding: Moving beyond simple regression, flow matching is used to generate diverse, high-fidelity future trajectories.
History and context: The model tokenizes two-second historical trajectories and multicamera inputs, processing them through a Qwen3-VL backbone to output explainable driving decisions. For example, “Nudge to the left to increase clearance.”
Performance: On DRIVE Thor, Alpamayo 1 achieves production-viable latencies, using FP8 acceleration for the Vision Transformer (ViT) components.

Get started with TensorRT Edge-LLM for physical AI

TensorRT Edge-LLM serves as the go-to-open-source, pure C++ inference runtime designed specifically for the mission-critical needs of automotive and robotics. It eliminates Python dependencies for deployment, ensuring predictable memory footprints.

From deploying the efficient expert routing of Qwen3 MoE today, to preparing for the future distilled reasoning of Alpamayo 1, NVIDIA provides the essential runtime to build the next generation of autonomous machines.

To get started, explore the new features, including the Alpamayo and MoE examples, in the updated TensorRT Edge-LLM GitHub repo or through the latest NVIDIA DriveOS releases.

Discuss (0)

About the Authors

About Lin Chai
Lin Chai is a senior product manager at NVIDIA, leading TensorRT and TensorRT Edge-LLM, NVIDIA’s AI inference platforms for deep learning across datacenter and embedded platforms. Drawing on her background in autonomous driving and automotive OEMs, she is inspired to build production-grade inference systems that deliver best-in-class performance for deep learning workloads across data center, edge, and physical AI applications—enabling systems that perceive, reason, and act in the real world.

View all posts by Lin Chai

About Luxiao Zheng
Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis.

View all posts by Luxiao Zheng

About Fan Shi
Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois.

View all posts by Fan Shi

About Maximilien Breughe
Maximilien Breughe is an engineering leader and software engineer at NVIDIA, where he works on AI inference systems and edge AI technologies. He has a background in deep learning libraries and performance engineering, and holds a PhD in Computer Architecture focused on performance simulation techniques. Maximilien is especially interested in building practical, high-performance AI systems that bridge research and real-world deployment.

View all posts by Maximilien Breughe

About Michael Ferry
Michael Ferry is a software engineering manager on the NVIDIA TensorRT team, where he leads the TensorRT Edge-LLM, Automotive Safety, and New Platforms teams. His work centers on optimized, reliable AI inference for safety-critical robotics and automotive edge systems. Before joining NVIDIA in 2018, Michael created and led several floating-point-focused verification tools at Intel. He holds a PhD in Mathematics, specializing in numerical optimization, from the University of California, San Diego.

View all posts by Michael Ferry