Accelerate AI With NVIDIA RTX PCs

NVIDIA RTX™ PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.


Overview of AI Inference Backends

Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

Diagram of inference backend model optimization tools

Who Is Windows ML For?

For developers who want to deploy performant, cross-vendor apps across Windows OS.

Inferencing Backends

Windows ML Runtime, built on ONNX Runtime, allows developers to run ONNX models locally across the entirety of PC hardware—including CPUs, NPUs, and GPUs. 

Windows ML automatically picks what execution provider to use, depending on the hardware available on the user’s PC, then downloads all the files necessary for that hardware. 

Windows ML is powered by NVIDIA TensorRT™ for RTX on NVIDIA GPUs for maximum performance.

For All AI Models—Get Started With DirectML AI Inferencing
For LLMs—Get Started With ONNX Runtime Gen AI Inferencing

Model Optimization

The Olive optimization toolkit offers hardware-aware quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the Windows ML inferencing backend. You can also use TensorRT-Model Optimizer to perform quantization for ONNX models.

Get Started With Olive
Get Started With TensorRT-Model Optimizer

Deployment Mechanisms

Packaging and deploying Windows ML-based apps on PCs is simple. Your application and the Windows ML are decoupled, allowing for OTA updates. Just add a reference to Windows ML within your project, and Windows ML will manage the download and install everything else—including versioning, execution providers, runtime, and all the dependencies.

Get Started With an End-To-End Sample

Introduction to ONNX Runtime

Watch Video (8:12)

ONNXRuntime-GenAI Installation and Inference Walkthrough

Watch Video (6:00)

Who Is Ollama For?

For large language model (LLM) developers who want wide reach with cross-vendor and cross-OS support.

Inferencing Backends

Ollama enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Ollama is powered by Llama.cpp and GGUF model formats.

Get Started With Ollama

Model Optimization

Ollama leverages model optimization formats, such as GGUF, both within and outside Llama.cpp tooling. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.

Get Started With Llama.cpp Model Quantization

Deployment Mechanisms

With Ollama, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API.

Get Started With Ollama

Who Is NVIDIA TensorRT for RTX For?

For developers looking for the latest features, maximum performance, and full behavior control on NVIDIA RTX GPUs.

Inferencing Backends

NVIDIA TensorRT™ for RTX offers the best performance for AI on RTX PCs, is lightweight for easy packaging into applications, and can generate optimized engines in just seconds on device.

Get Started With TensorRT for RTX

Optimize Your Models

TensorRT for RTX uses a just-in-time (JIT) engine builder to compile any ONNX model with optimizations that take full advantage of the user’s specific GPU-configuration. It happens transparently to the user, taking less than 30 seconds on first setup.

Get Started With TensorRT for RTX

Deployment Mechanisms

With TensorRT for RTX, deploying AI apps is easier. Developers can include both the model and the lightweight TensorRT runtime (under 200MB) inside their applications. When a user installs the app or on first run, TensorRT-RTX quickly compiles the model for their specific hardware in under 30 seconds, ensuring peak performance.

Learn More About TensorRT for RTX

Who Is PyTorch For?

For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.

Inferencing Backends

PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.

Get Started With PyTorch

Model Optimization

PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.

Get Started With torchao

Deployment Mechanisms

To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.

Get Started With torchserve
Get Started With HuggingFace Accelerate

Who Is NVIDIA NIM For?

For generative AI experimentation and workflow building.

Inferencing Backends

NVIDIA NIM™ microservices are optimized, pre-packaged models for generative AI. They leverage the power of NVIDIA TensorRT™ to provide performance-optimized generative AI models on the PC.

Get Started With NVIDIA NIM

Model Optimization

NVIDIA NIM comes pre-optimized for RTX AI PCs. It can include quantized model checkpoints and engines that are optimized for memory resource utilization on NVIDIA RTX GPUs.

Get Started With NVIDIA NIM

Deployment Mechanisms

With NVIDIA NIM, you can deploy anywhere, from PC to cloud. On RTX AI PC, you get a container with a REST API, running on localhost. You can easily leverage the REST API to build custom generative AI workflows and agentic applications.

Get Started With NVIDIA NIM

Choosing an Inferencing Backend

Windows ML
TensorRT for RTX
Ollama
NVIDIA NIM
PyTorch-CUDA
For
Application developers building AI features for Windows PC
Application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs
LLM developers who want wide reach with cross-vendor and cross-OS support
Developers experimenting with generative AI and building agentic workflows
Developers experimenting with and evaluating AI while maintaining cohesion with model training pipelines
Performance
Faster
Fastest
Fast
Fast
Good
OS Support
Windows
Windows and Linux
Windows, Linux, and Mac
Windows Subsystem for and Linux
Windows and Linux
Hardware Support
Any GPU or CPU
NVIDIA RTX GPUs
Any GPU or CPU
Any NVIDIA GPU
Any GPU or CPU
Model Checkpoint Format
ONNX
ONNX
GGUF or GGML
Various
PyT
Installation Process
Pre-installed on Windows
Install SDK and Python bindings
Installation of Python packages required
Run via Podman containers
Installation of Python packages required
LLM Support
Coming Soon
CNN Support
-
Model Optimizations
Microsoft Olive
TensorRT-Model Optimizer
Llama.cpp
Various
-
Python
C/C++
-
C#/.NET
-
-
-
Javascript
-
-

Latest NVIDIA News


Stay up to date on how to power your AI apps with NVIDIA RTX PCs.

Learn More