Accelerate AI With NVIDIA RTX PCs

NVIDIA RTX™ PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.


Overview of AI Inference Backends

Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

Who is it for?

For developers who want to deploy performant, cross-vendor apps across Windows OS.

Inferencing Backends

ONNX Runtime, in conjunction with the DirectML backend, is a cross-platform machine-learning model accelerator for Windows, allowing access to hardware-specific optimizations.

For AI Models—Get Started With DirectML AI Inferencing
For Generative AI—Get Started With ONNX Runtime GenAI Inferencing

Model Optimization

The Olive optimization toolkit offers quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the ONNX-Runtime & DirectML inferencing backend. You can also use TensorRT Model Optimizer to perform quantization for ONNX models.

Get Started With Olive
Get Started with TensorRT Model Optimizer

Deployment Mechanisms

Packaging and deploying ONNX Runtime apps on PCs is simple. DirectML comes pre-installed in Windows. All you need to do is ship your model and, for LLMs, the ONNX Runtime GenAI SDK.

Get Started With an End-to-End Sample

Introduction to ONNX Runtime

Watch Video (8:12)

ONNXRuntime-GenAI Installation and Inference Walkthrough

Watch Video (6:00)

Who is it for?

For LLM developers who want wide reach with cross-vendor and cross-OS support.

Inferencing Backends

Llama.cpp enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Llama.cpp is developed and maintained by a large open-source community and offers a wide range of LLM support.

Get Started With Llama.cpp

Model Optimization

Natively, Llama.cpp offers an optimized model format with GGUF. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.

Get Started With Llama.cpp Model Quantization

Deployment Mechanisms

With Llama.cpp, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API. Some popular tools for this include Cortex, Ollama, and LMStudio. For in-process execution, it requires installation of Llama.cpp in .lib or .dll formats within an app.

Get Started With Ollama
Get Started With LMStudio

Get Started With Cortex
Get Started With In-process Execution

Who is it for?

For developers looking for the latest features and maximum performance on NVIDIA RTX GPUs.

Inferencing Backends

NVIDIA® TensorRT™ offers maximum performance deep learning inference on NVIDIA RTX GPUs, with GPU-specific TRT engines that extract the last ounce of performance from the GPU.

Get Started With TensorRT
Get Started With TensorRT-LLM

Optimize Your Models

To optimize models within the TensorRT ecosystem, developers can use TensorRT-Model Optimizer. This unified library offers state-of-the-art model optimization techniques, such as quantization, pruning, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT to optimize inference speed on NVIDIA GPUs.

Get Started With TensorRT Model Optimizer

Deployment Mechanisms

Deploying TensorRT models requires 3 things: TensorRT, a TensorRT optimized model, and a TensorRT engine.
TensorRT engines can be pre-generated ahead of time, or generated within your app using Timing Caches.

Get Started With NVIDIA TensorRT Deployment

Who is it for?

For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.

Inferencing Backends

PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.

Get Started With PyTorch

Model Optimization

PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.

Get Started With torchao

Deployment Mechanisms

To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.

Get Started With torchserve
Get Started With HuggingFace Accelerate

Choosing an Inferencing Backend

ONNX Runtime With DirectML
TensorRT and  TensorRT-LLM
Llama.cpp
PyTorch-CUDA
Performance
Faster
Fastest
Fast
Good
OS Support
Windows
Windows and Linux
(TensortRT-LLM is Linux Only)
Windows, Linux, and Mac
Windows and Linux
Hardware Support
Any GPU or CPU
NVIDIA RTX GPUs
Any GPU or CPU
Any GPU or CPU
Model Checkpoint Format
ONNX
TRT
GGUF or GGML
PyT
Installation Process
Pre-Installed on Windows
Installation of Python packages required
Installation of Python packages required
Installation of Python packages required
LLM Support
✔️
✔️
✔️
✔️
CNN Support
✔️
✔️
-
✔️
Device- Specific Optimizations
Microsoft Olive
TensorRT-Model Optimizer
Llama.cpp
-
Python
✔️
✔️
✔️
✔️
C/C++
✔️
✔️
✔️
✔️
C#/.NET
✔️
-
✔️
-
Javascript
✔️
-
✔️
-

Latest NVIDIA News


Stay up to date on how to power your AI apps with NVIDIA RTX PCs.

Learn More