Accelerate AI With NVIDIA RTX PCs
NVIDIA RTX™ PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.
Overview of AI Inference Backends
Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.
Who is it for?
For developers who want to deploy performant, cross-vendor apps across Windows OS.
Inferencing Backends
ONNX Runtime, in conjunction with the DirectML backend, is a cross-platform machine-learning model accelerator for Windows, allowing access to hardware-specific optimizations.
For AI Models—Get Started With DirectML AI InferencingFor Generative AI—Get Started With ONNX Runtime GenAI InferencingModel Optimization
The Olive optimization toolkit offers quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the ONNX-Runtime & DirectML inferencing backend. You can also use TensorRT Model Optimizer to perform quantization for ONNX models.
Get Started With OliveGet Started with TensorRT Model OptimizerDeployment Mechanisms
Packaging and deploying ONNX Runtime apps on PCs is simple. DirectML comes pre-installed in Windows. All you need to do is ship your model and, for LLMs, the ONNX Runtime GenAI SDK.
Get Started With an End-to-End SampleIntroduction to ONNX Runtime
Watch Video (8:12) ONNXRuntime-GenAI Installation and Inference Walkthrough
Watch Video (6:00)Who is it for?
For LLM developers who want wide reach with cross-vendor and cross-OS support.
Inferencing Backends
Llama.cpp enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Llama.cpp is developed and maintained by a large open-source community and offers a wide range of LLM support.
Get Started With Llama.cppModel Optimization
Natively, Llama.cpp offers an optimized model format with GGUF. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.
Get Started With Llama.cpp Model QuantizationDeployment Mechanisms
With Llama.cpp, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API. Some popular tools for this include Cortex, Ollama, and LMStudio. For in-process execution, it requires installation of Llama.cpp in .lib or .dll formats within an app.
Get Started With OllamaGet Started With LMStudioGet Started With Cortex Get Started With In-process Execution
Who is it for?
For developers looking for the latest features and maximum performance on NVIDIA RTX GPUs.
Inferencing Backends
NVIDIA® TensorRT™ offers maximum performance deep learning inference on NVIDIA RTX GPUs, with GPU-specific TRT engines that extract the last ounce of performance from the GPU.
Optimize Your Models
To optimize models within the TensorRT ecosystem, developers can use TensorRT-Model Optimizer. This unified library offers state-of-the-art model optimization techniques, such as quantization, pruning, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT to optimize inference speed on NVIDIA GPUs.
Get Started With TensorRT Model OptimizerDeployment Mechanisms
Deploying TensorRT models requires 3 things: TensorRT, a TensorRT optimized model, and a TensorRT engine.
TensorRT engines can be pre-generated ahead of time, or generated within your app using Timing Caches.
Who is it for?
For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.
Inferencing Backends
PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.
Model Optimization
PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.
Deployment Mechanisms
To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.
Get Started With torchserve Get Started With HuggingFace AccelerateChoosing an Inferencing Backend
ONNX Runtime With DirectML | TensorRT and TensorRT-LLM | Llama.cpp | PyTorch-CUDA | |
---|---|---|---|---|
Performance | Faster | Fastest | Fast | Good |
OS Support | Windows | Windows and Linux (TensortRT-LLM is Linux Only) | Windows, Linux, and Mac | Windows and Linux |
Hardware Support | Any GPU or CPU | NVIDIA RTX GPUs | Any GPU or CPU | Any GPU or CPU |
Model Checkpoint Format | ONNX | TRT | GGUF or GGML | PyT |
Installation Process | Pre-Installed on Windows | Installation of Python packages required | Installation of Python packages required | Installation of Python packages required |
LLM Support | ✔️ | ✔️ | ✔️ | ✔️ |
CNN Support | ✔️ | ✔️ | - | ✔️ |
Device- Specific Optimizations | Microsoft Olive | TensorRT-Model Optimizer | Llama.cpp | - |
Python | ✔️ | ✔️ | ✔️ | ✔️ |
C/C++ | ✔️ | ✔️ | ✔️ | ✔️ |
C#/.NET | ✔️ | - | ✔️ | - |
Javascript | ✔️ | - | ✔️ | - |
Latest NVIDIA News
Stay up to date on how to power your AI apps with NVIDIA RTX PCs.