Accelerate AI With NVIDIA RTX PCs

NVIDIA RTX™ PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.

Overview of AI Inference Backends

Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

Diagram of inference backend model optimization tools

Windows ML
(Public Preview)

NVIDIA TensorRT for RTX

Ollama

NVIDIA NIM

PyTorch

Who Is Windows ML For?

For developers who want to deploy performant, cross-vendor apps across Windows OS.

Inferencing Backends

Windows ML Runtime, built on ONNX Runtime, allows developers to run ONNX models locally across the entirety of PC hardware—including CPUs, NPUs, and GPUs.

Windows ML automatically picks what execution provider to use, depending on the hardware available on the user’s PC, then downloads all the files necessary for that hardware.

Windows ML is powered by NVIDIA TensorRT™ for RTX on NVIDIA GPUs for maximum performance.

For All AI Models—Get Started With DirectML AI Inferencing

For LLMs—Get Started With ONNX Runtime Gen AI Inferencing

Model Optimization

The Olive optimization toolkit offers hardware-aware quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the Windows ML inferencing backend. You can also use TensorRT-Model Optimizer to perform quantization for ONNX models.

Get Started With Olive

Get Started With TensorRT-Model Optimizer

Deployment Mechanisms

Packaging and deploying Windows ML-based apps on PCs is simple. Your application and the Windows ML are decoupled, allowing for OTA updates. Just add a reference to Windows ML within your project, and Windows ML will manage the download and install everything else—including versioning, execution providers, runtime, and all the dependencies.

Get Started With an End-To-End Sample

Introduction to ONNX Runtime

Watch Video (8:12)

ONNXRuntime-GenAI Installation and Inference Walkthrough

Watch Video (6:00)

Who Is Ollama For?

For large language model (LLM) developers who want wide reach with cross-vendor and cross-OS support.

Inferencing Backends

Ollama enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Ollama is powered by Llama.cpp and GGUF model formats.

Get Started With Ollama

Model Optimization

Ollama leverages model optimization formats, such as GGUF, both within and outside Llama.cpp tooling. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.

Get Started With Llama.cpp Model Quantization

Deployment Mechanisms

With Ollama, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API.

Get Started With Ollama

Who Is NVIDIA TensorRT for RTX For?

For developers looking for the latest features, maximum performance, and full behavior control on NVIDIA RTX GPUs.

Inferencing Backends

NVIDIA TensorRT™ for RTX offers the best performance for AI on RTX PCs, is lightweight for easy packaging into applications, and can generate optimized engines in just seconds on device.

Get Started With TensorRT for RTX

Optimize Your Models

TensorRT for RTX uses a just-in-time (JIT) engine builder to compile any ONNX model with optimizations that take full advantage of the user’s specific GPU-configuration. It happens transparently to the user, taking less than 30 seconds on first setup.

Get Started With TensorRT for RTX

Deployment Mechanisms

With TensorRT for RTX, deploying AI apps is easier. Developers can include both the model and the lightweight TensorRT runtime (under 200MB) inside their applications. When a user installs the app or on first run, TensorRT-RTX quickly compiles the model for their specific hardware in under 30 seconds, ensuring peak performance.

Learn More About TensorRT for RTX

Who Is PyTorch For?

For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.

Inferencing Backends

PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.

Get Started With PyTorch

Model Optimization

PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.

Get Started With torchao

Deployment Mechanisms

To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.

Get Started With torchserve

Get Started With HuggingFace Accelerate

Who Is NVIDIA NIM For?

For generative AI experimentation and workflow building.

Inferencing Backends

NVIDIA NIM™ microservices are optimized, pre-packaged models for generative AI. They leverage the power of NVIDIA TensorRT™ to provide performance-optimized generative AI models on the PC.

Get Started With NVIDIA NIM

Model Optimization

NVIDIA NIM comes pre-optimized for RTX AI PCs. It can include quantized model checkpoints and engines that are optimized for memory resource utilization on NVIDIA RTX GPUs.

Get Started With NVIDIA NIM

Deployment Mechanisms

With NVIDIA NIM, you can deploy anywhere, from PC to cloud. On RTX AI PC, you get a container with a REST API, running on localhost. You can easily leverage the REST API to build custom generative AI workflows and agentic applications.

Get Started With NVIDIA NIM

Choosing an Inferencing Backend


	Windows ML	TensorRT for RTX	Ollama	NVIDIA NIM	PyTorch-CUDA
For	Application developers building AI features for Windows PC	Application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs	LLM developers who want wide reach with cross-vendor and cross-OS support	Developers experimenting with generative AI and building agentic workflows	Developers experimenting with and evaluating AI while maintaining cohesion with model training pipelines
Performance	Faster	Fastest	Fast	Fast	Good
OS Support	Windows	Windows and Linux	Windows, Linux, and Mac	Windows Subsystem for and Linux	Windows and Linux
Hardware Support	Any GPU or CPU	NVIDIA RTX GPUs	Any GPU or CPU	Any NVIDIA GPU	Any GPU or CPU
Model Checkpoint Format	ONNX	ONNX	GGUF or GGML	Various	PyT
Installation Process	Pre-installed on Windows	Install SDK and Python bindings	Installation of Python packages required	Run via Podman containers	Installation of Python packages required
LLM Support		Coming Soon
CNN Support			-
Model Optimizations	Microsoft Olive	TensorRT-Model Optimizer	Llama.cpp	Various	-
Python
C/C++				-
C#/.NET		-		-	-
Javascript		-			-

Latest NVIDIA News

NVIDIA TensorRT for RTX

NVIDIA NIM

Ollama

Stay up to date on how to power your AI apps with NVIDIA RTX PCs.

Learn More