Agentic AI / Generative AI

Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK

Mar 10, 2025

By Chen Fu, Luxiao Zheng, Fan Shi, Le An and Josh Park

Discuss (2)

AI-Generated Summary

Dislike

The NVIDIA DriveOS LLM SDK is a library designed to optimize the inference of state-of-the-art large language models (LLMs) and vision language models (VLMs) on the DRIVE AGX platform for autonomous vehicles, utilizing the NVIDIA TensorRT inference engine.
The SDK includes key components such as a plugin library, tokenizer/detokenizer, sampler, and decoder, which enable flexible, lightweight, and high-performance LLM deployment on multiple NVIDIA DRIVE platforms.
The DriveOS LLM SDK supports various precision formats, including FP16, FP8, NVFP4, and INT4, and provides a simplified deployment workflow that streamlines the process into exporting the ONNX model and building the engine.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of applications, including translation, digital assistants, recommendation systems, context analysis, code generation, cybersecurity, and more. In automotive applications, there is growing demand for LLM-based solutions for both autonomous driving and in-cabin features. Deploying LLMs and vision language models (VLMs) on automotive platforms, which are typically resource-constrained, has become a critical challenge.

This post introduces the NVIDIA DriveOS LLM SDK, a library designed to optimize the inference of state-of-the-art LLMs and VLMs on the DRIVE AGX platform for autonomous vehicles. It is a lightweighted toolkit built on top of the NVIDIA TensorRT inference engine. It incorporates LLM-specific optimizations such as custom attention kernels and quantization techniques to deploy LLM on automotive platforms.

The toolkit provides easy-to-use C++ libraries and example code to export, build a TensorRT engine, perform inference, and benchmark LLMs with a complete end-to -end workflow. We walk you through key components of the SDK with an introduction to its supported models and deployment workflow.

Key components of the NVIDIA DriveOS SDK

The DriveOS LLM SDK includes several key components designed for efficient LLM inference. These components ensure efficient deployment of LLMs on automotive platforms and include:

Plugin library: LLMs require specialized plugins for advanced capabilities and optimized performance. The DriveOS LLM SDK includes these custom plugins, along with a set of kernels to handle context-dependent components such as rotary positional embedding, multihead attention, and KV-cache management. AttentionPlugin also supports dynamic batch sizes and dynamic input sequence lengths.
Tokenizer/detokenizer: The SDK offers an efficient tokenizer/detokenizer for LLM inference, following the Llama-style byte pair encoding (BPE) tokenizer with regex matching. This module converts multimodal user inputs (text or images, for example) into a stream of tokens, enabling seamless integration across different data types.
Sampler: The Sampler is crucial for tasks like text generation, translation, and dialogue, as it controls how the model generates text and selects tokens during inference. The DriveOS LLM SDK implements a CUDA-based sampler that optimizes this process. To balance inference efficiency and output diversity, the sampler uses a single-beam sampling approach with Top-K option. This method offers a quick yet reasonably diverse output without the computational expense of exploring multiple beams. This is important for automotive applications as latency and efficiency need to be considered.
Decoder: During LLM inference, the decoder module generates text or sequences by iteratively producing tokens based on the model’s predictions. The DriveOS LLM SDK provides a flexible decoding loop that supports static batch sizes, padded input sequences, and generation towards the longest sequence in the batch.

Together, these components enable flexible, lightweight, high-performance LLM deployment and customization on multiple NVIDIA DRIVE platforms (Figure 1).

Supported models, precision formats, and platforms

The DriveOS LLM SDK supports a range of state-of-the-art LLMs on DRIVE platforms, including NVIDIA DRIVE AGX Orin and NVIDIA DRIVE AGX Thor. As a preview feature, the SDK can also run on x86 systems which can be useful for development purposes. Currently supported models include the following, with additional models expected in the future:

Llama 3 8B Instruct
Llama 3.1 8B
Llama 3.2 3B
Qwen2.5 7B Instruct
Qwen2 7B Instruct
Qwen2 VL

The SDK supports multiple precision formats to unblock large LLMs on different platforms, including FP16, FP8, NVFP4, and INT4. For INT4 (W4A16) precision, model weights are quantized to INT4 using the AWQ recipe, with computations performed in FP16. This approach significantly reduces memory usage. The SDK also supports FP8 (W8A8) precision with TensorRT versions greater than 10.4 and NVFP4 precision with TensorRT version greater than 10.8 on NVIDIA DRIVE AGX Thor platforms.

These precisions can further reduce the memory footprint during LLM inference while enhancing kernel performance. In this configuration, weights and GEMM operations are in FP8 or NVFP4 format, while LayerNorm, KV cache, LM head, and attention layers remain in FP16. Overall, the DriveOS LLM SDK is designed to efficiently support a wide range of LLMs with multimodal inputs and a variety of precision formats across multiple platforms.

Workflow for LLM deployment

LLM deployment is typically a complex process that requires substantial engineering effort, particularly on edge devices. The DriveOS LLM SDK offers a simplified solution for deploying LLMs on the DRIVE platform. The proposed SDK streamlines the deployment workflow into two straightforward steps: exporting the ONNX model and building the engine (Figure 2). This process closely mirrors the standard procedure for deploying deep learning models with TensorRT.

Quantization plays a crucial role in optimizing LLM deployment, particularly for resource-constrained platforms. It can significantly improve the efficiency and scalability of LLMs. The DriveOS LLM SDK addresses this need by offering multiple quantization options during the ONNX model export phase, which can be easily invoked with one command:

python3 llm_export.py --torch_dir $TORCH_DIR --dtype [fp16|fp8|int4] --output_dir $ONNX_DIR

This command converts an LLM from Hugging Face format into an ONNX model with the specified quantized precision. This step is recommended to be performed on x86 data center GPUs, to avoid out-of-memory (OOM) issues.

After the model has been exported to ONNX, the llm_build binary can be used to create the corresponding TensorRT engine. The build process is agnostic to the specific model or precision, as the IO interface remains standardized across all ONNX models. The engine shall be built on the DRIVE platform with the following command:

./build/examples/llm/llm_build --onnxPath=model.onnx --enginePath=model.engine --batchSize=B --maxInputLen=N --maxSeqLen=M

The SDK also includes a cross-compilation build system that enables compiling AArch64 targets on x86 machines. This functionality accelerates deployment and simplifies feature verification on edge compute platforms.

In addition to its user-friendly deployment process, the DriveOS LLM SDK provides a variety of C++ code examples for end-to-end LLM inference, performance benchmarking, and live chat implementations. These examples enable developers to evaluate the accuracy and performance of different models on DRIVE platforms, using static batch sizes and input/output sequence lengths, or to customize their own applications.

To use the SDK provided C++ code to enable an LLM chatbot, use the following sample command:

./build/examples/llm/llm_chat --tokenizerPath=llama-v3-8b-instruct-hf/ --enginePath=llama3_fp16.engine --maxLength=64

The overall inference pipeline for this command is shown as Figure 3, with the DriveOS LLM SDK-related components in blue blocks.

Multimodal LLM deployment

Unlike traditional LLMs, language models used in automotive applications often require multimodal inputs, such as camera images, text, and more. The DriveOS LLM SDK addresses these needs by providing specialized inferences and modules designed for state-of-the-art VLMs.

Currently, the SDK supports the Qwen2 VL model, with a C++ implementation of the image preprocessor, based on the official Qwen2 VL GitHub repository. This module efficiently loads images, resizes them, divides them into small patches (with merging), normalizes pixel values, and stores the patches in a temporal format that aligns with the language model.

To deploy a multimodal LLM, the vision encoder and language model must be exported and the engine built separately. To streamline this process, the DriveOS LLM SDK offers Python scripts and C++ utilities, simplifying the TensorRT model engine build with standardized steps.

Summary

The NVIDIA DriveOS LLM SDK streamlines the deployment of LLMs and VLMs on the DRIVE platform. By leveraging the powerful NVIDIA TensorRT inference engine along with LLM-specific optimization techniques such as quantization, cutting-edge LLMs and VLMs can be deployed with ease on the DRIVE platform. This SDK serves as a foundation for deploying powerful LLMs in production environments, ultimately enhancing the performance of AI-driven applications.

Learn more about NVIDIA DRIVE solutions for autonomous vehicles.

Discuss (2)

About the Authors

About Chen Fu
Chen Fu is a senior deep learning software engineer at NVIDIA. His research interests include LLM, 3D perception, computer vision, and sensor fusion for autonomous vehicle applications. He received his Ph.D. and master’s degrees in electrical and computer engineering from Carnegie Mellon University, and his bachelor’s degree in electrical engineering from Queen Mary University of London, jointly with Beijing University of Posts and Telecommunications.

View all posts by Chen Fu

About Luxiao Zheng
Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis.

View all posts by Luxiao Zheng

About Fan Shi
Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois.

View all posts by Fan Shi

About Le An
Le An is an engineering manager at NVIDIA who works on machine learning, deep learning, and computer vision techniques and their applications in autonomous vehicles and beyond. Le received his Ph.D. from the University of California, Riverside, his M.S. from the Eindhoven University of Technology in the Netherlands, and his B.S. from Zhejiang University in China.

View all posts by Le An

About Josh Park
Josh Park is a senior manager at NVIDIA, where he specializes in the development of deep learning solutions using DL frameworks on multi-GPU and multi-node servers and embedded systems. His expertise extends to the evaluation and enhancement of training and inference performances across diverse GPU architectures, including x86_64 and aarch64. He earned his Ph.D. in computer science from Texas A&M University.

View all posts by Josh Park