Generative AI

Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK

Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of applications, including translation, digital assistants, recommendation systems, context analysis, code generation, cybersecurity, and more. In automotive applications, there is growing demand for LLM-based solutions for both autonomous driving and in-cabin features. Deploying LLMs and vision language models (VLMs) on automotive platforms, which are typically resource-constrained, has become a critical challenge. 

This post introduces the NVIDIA DriveOS LLM SDK, a library designed to optimize the inference of state-of-the-art LLMs and VLMs on the DRIVE AGX platform for autonomous vehicles. It is a lightweighted toolkit built on top of the NVIDIA TensorRT inference engine. It incorporates LLM-specific optimizations such as custom attention kernels and quantization techniques to deploy LLM on automotive platforms. 

The toolkit provides easy-to-use C++ libraries and example code to export, build a TensorRT engine, perform inference, and benchmark LLMs with a complete end-to -end workflow. We walk you through key components of the SDK with an introduction to its supported models and deployment workflow. 

Key components of the NVIDIA DriveOS SDK

The DriveOS LLM SDK includes several key components designed for efficient LLM inference. These components ensure efficient deployment of LLMs on automotive platforms and include:

  • Plugin library: LLMs require specialized plugins for advanced capabilities and optimized performance. The DriveOS LLM SDK includes these custom plugins, along with a set of kernels to handle context-dependent components such as rotary positional embedding, multihead attention, and KV-cache management. AttentionPlugin also supports dynamic batch sizes and dynamic input sequence lengths.
  • Tokenizer/detokenizer: The SDK offers an efficient tokenizer/detokenizer for LLM inference, following the Llama-style byte pair encoding (BPE) tokenizer with regex matching. This module converts multimodal user inputs (text or images, for example) into a stream of tokens, enabling seamless integration across different data types.
  • Sampler: The Sampler is crucial for tasks like text generation, translation, and dialogue, as it controls how the model generates text and selects tokens during inference. The DriveOS LLM SDK implements a CUDA-based sampler that optimizes this process. To balance inference efficiency and output diversity, the sampler uses a single-beam sampling approach with Top-K option. This method offers a quick yet reasonably diverse output without the computational expense of exploring multiple beams. This is important for automotive applications as latency and efficiency need to be considered. 
  • Decoder: During LLM inference, the decoder module generates text or sequences by iteratively producing tokens based on the model’s predictions. The DriveOS LLM SDK provides a flexible decoding loop that supports static batch sizes, padded input sequences, and generation towards the longest sequence in the batch.

Together, these components enable flexible, lightweight, high-performance LLM deployment and customization on multiple NVIDIA DRIVE platforms (Figure 1).

Graphic showing DriveOS LLM SDK components, including TensorRT and plugins, decoder loop, sampler, C++ interfaces, and LLM sample.
Figure 1. DriveOS LLM SDK major components and architecture plan 

Supported models, precision formats, and platforms

The DriveOS LLM SDK supports a range of state-of-the-art LLMs on DRIVE platforms, including NVIDIA DRIVE AGX Orin and NVIDIA DRIVE AGX Thor. As a preview feature, the SDK can also run on x86 systems which can be useful for development purposes. Currently supported models include the following, with additional models expected in the future:

  • Llama 3 8B Instruct
  • Llama 3.1 8B
  • Llama 3.2 3B
  • Qwen2.5 7B Instruct
  • Qwen2 7B Instruct
  • Qwen2 VL

The SDK supports multiple precision formats to unblock large LLMs on different platforms, including FP16, FP8, NVFP4, and INT4. For INT4 (W4A16) precision, model weights are quantized to INT4 using the AWQ recipe, with computations performed in FP16. This approach significantly reduces memory usage. The SDK also supports FP8 (W8A8) precision with TensorRT versions greater than 10.4 and NVFP4 precision with TensorRT version greater than 10.8 on NVIDIA DRIVE AGX Thor platforms. 

These precisions can further reduce the memory footprint during LLM inference while enhancing kernel performance. In this configuration, weights and GEMM operations are in FP8 or NVFP4 format, while LayerNorm, KV cache, LM head, and attention layers remain in FP16. Overall, the DriveOS LLM SDK is designed to efficiently support a wide range of LLMs with multimodal inputs and a variety of precision formats across multiple platforms.

Workflow for LLM deployment

LLM deployment is typically a complex process that requires substantial engineering effort, particularly on edge devices. The DriveOS LLM SDK offers a simplified solution for deploying LLMs on the DRIVE platform. The proposed SDK streamlines the deployment workflow into two straightforward steps: exporting the ONNX model and building the engine (Figure 2). This process closely mirrors the standard procedure for deploying deep learning models with TensorRT.

Flow chart showing that DriveOS LLM SDK streamlines the LLM deployment into two major steps, model export and quantization, and build TensorRT Engine. This procedure is lightweighted and easy to deploy an LLM model on an autonomous vehicle.
Figure 2. Steps to deploy an LLM using DriveOS LLM SDK

Quantization plays a crucial role in optimizing LLM deployment, particularly for resource-constrained platforms. It can significantly improve the efficiency and scalability of LLMs. The DriveOS LLM SDK addresses this need by offering multiple quantization options during the ONNX model export phase, which can be easily invoked with one command:

python3 llm_export.py --torch_dir $TORCH_DIR --dtype [fp16|fp8|int4] --output_dir $ONNX_DIR

This command converts an LLM from Hugging Face format into an ONNX model with the specified quantized precision. This step is recommended to be performed on x86 data center GPUs, to avoid out-of-memory (OOM) issues. 

After the model has been exported to ONNX, the llm_build binary can be used to create the corresponding TensorRT engine. The build process is agnostic to the specific model or precision, as the IO interface remains standardized across all ONNX models. The engine shall be built on the DRIVE platform with the following command:

./build/examples/llm/llm_build --onnxPath=model.onnx --enginePath=model.engine --batchSize=B --maxInputLen=N --maxSeqLen=M

The SDK also includes a cross-compilation build system that enables compiling AArch64 targets on x86 machines. This functionality accelerates deployment and simplifies feature verification on edge compute platforms.

In addition to its user-friendly deployment process, the DriveOS LLM SDK provides a variety of C++ code examples for end-to-end LLM inference, performance benchmarking, and live chat implementations. These examples enable developers to evaluate the accuracy and performance of different models on DRIVE platforms, using static batch sizes and input/output sequence lengths, or to customize their own applications. 

To use the SDK provided C++ code to enable an LLM chatbot, use the following sample command:

./build/examples/llm/llm_chat --tokenizerPath=llama-v3-8b-instruct-hf/ --enginePath=llama3_fp16.engine --maxLength=64

The overall inference pipeline for this command is shown as Figure 3, with the DriveOS LLM SDK-related components in blue blocks.

Pipeline using DriveOS LLM SDK for inference, including text input, tokens, tokenizer, detokenizer, text output, samplers, plugins.
Figure 3. Pipeline using DriveOS LLM SDK for inference

Multimodal LLM deployment

Unlike traditional LLMs, language models used in automotive applications often require multimodal inputs, such as camera images, text, and more. The DriveOS LLM SDK addresses these needs by providing specialized inferences and modules designed for state-of-the-art VLMs. 

Currently, the SDK supports the Qwen2 VL model, with a C++ implementation of the image preprocessor, based on the official Qwen2 VL GitHub repository. This module efficiently loads images, resizes them, divides them into small patches (with merging), normalizes pixel values, and stores the patches in a temporal format that aligns with the language model.

To deploy a multimodal LLM, the vision encoder and language model must be exported and the engine built separately. To streamline this process, the DriveOS LLM SDK offers Python scripts and C++ utilities, simplifying the TensorRT model engine build with standardized steps.

Summary

The NVIDIA DriveOS LLM SDK streamlines the deployment of LLMs and VLMs on the DRIVE platform. By leveraging the powerful NVIDIA TensorRT inference engine along with LLM-specific optimization techniques such as quantization, cutting-edge LLMs and VLMs can be deployed with ease on the DRIVE platform. This SDK serves as a foundation for deploying powerful LLMs in production environments, ultimately enhancing the performance of AI-driven applications.

Learn more about NVIDIA DRIVE solutions for autonomous vehicles.

Discuss (1)

Tags