Accelerate AI With NVIDIA RTX PCs
NVIDIA RTX™ PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.
Overview of AI Inference Backends
Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

Who Is Windows ML For?
For developers who want to deploy performant, cross-vendor apps across Windows OS.
Inferencing Backends
Windows ML Runtime, built on ONNX Runtime, allows developers to run ONNX models locally across the entirety of PC hardware—including CPUs, NPUs, and GPUs.
Windows ML automatically picks what execution provider to use, depending on the hardware available on the user’s PC, then downloads all the files necessary for that hardware.
Windows ML is powered by NVIDIA TensorRT™ for RTX on NVIDIA GPUs for maximum performance.
Model Optimization
The Olive optimization toolkit offers hardware-aware quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the Windows ML inferencing backend. You can also use TensorRT-Model Optimizer to perform quantization for ONNX models.
Get Started With OliveGet Started With TensorRT-Model OptimizerDeployment Mechanisms
Packaging and deploying Windows ML-based apps on PCs is simple. Your application and the Windows ML are decoupled, allowing for OTA updates. Just add a reference to Windows ML within your project, and Windows ML will manage the download and install everything else—including versioning, execution providers, runtime, and all the dependencies.
Get Started With an End-To-End SampleIntroduction to ONNX Runtime
Watch Video (8:12) ONNXRuntime-GenAI Installation and Inference Walkthrough
Watch Video (6:00)Who Is Ollama For?
For large language model (LLM) developers who want wide reach with cross-vendor and cross-OS support.
Inferencing Backends
Ollama enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Ollama is powered by Llama.cpp and GGUF model formats.
Get Started With OllamaModel Optimization
Ollama leverages model optimization formats, such as GGUF, both within and outside Llama.cpp tooling. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.
Get Started With Llama.cpp Model QuantizationDeployment Mechanisms
With Ollama, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API.
Get Started With OllamaWho Is NVIDIA TensorRT for RTX For?
For developers looking for the latest features, maximum performance, and full behavior control on NVIDIA RTX GPUs.
Inferencing Backends
NVIDIA TensorRT™ for RTX offers the best performance for AI on RTX PCs, is lightweight for easy packaging into applications, and can generate optimized engines in just seconds on device.
Optimize Your Models
TensorRT for RTX uses a just-in-time (JIT) engine builder to compile any ONNX model with optimizations that take full advantage of the user’s specific GPU-configuration. It happens transparently to the user, taking less than 30 seconds on first setup.
Get Started With TensorRT for RTXDeployment Mechanisms
With TensorRT for RTX, deploying AI apps is easier. Developers can include both the model and the lightweight TensorRT runtime (under 200MB) inside their applications. When a user installs the app or on first run, TensorRT-RTX quickly compiles the model for their specific hardware in under 30 seconds, ensuring peak performance.
Learn More About TensorRT for RTXWho Is PyTorch For?
For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.
Inferencing Backends
PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.
Model Optimization
PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.
Deployment Mechanisms
To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.
Get Started With torchserve Get Started With HuggingFace AccelerateWho Is NVIDIA NIM For?
For generative AI experimentation and workflow building.
Inferencing Backends
NVIDIA NIM™ microservices are optimized, pre-packaged models for generative AI. They leverage the power of NVIDIA TensorRT™ to provide performance-optimized generative AI models on the PC.
Get Started With NVIDIA NIMModel Optimization
NVIDIA NIM comes pre-optimized for RTX AI PCs. It can include quantized model checkpoints and engines that are optimized for memory resource utilization on NVIDIA RTX GPUs.
Get Started With NVIDIA NIMDeployment Mechanisms
With NVIDIA NIM, you can deploy anywhere, from PC to cloud. On RTX AI PC, you get a container with a REST API, running on localhost. You can easily leverage the REST API to build custom generative AI workflows and agentic applications.
Get Started With NVIDIA NIMChoosing an Inferencing Backend
Windows ML | TensorRT for RTX | Ollama | NVIDIA NIM | PyTorch-CUDA | |
---|---|---|---|---|---|
For | Application developers building AI features for Windows PC | Application developers who want maximum control and flexibility of AI behavior on NVIDIA RTX GPUs | LLM developers who want wide reach with cross-vendor and cross-OS support | Developers experimenting with generative AI and building agentic workflows | Developers experimenting with and evaluating AI while maintaining cohesion with model training pipelines |
Performance | Faster | Fastest | Fast | Fast | Good |
OS Support | Windows | Windows and Linux | Windows, Linux, and Mac | Windows Subsystem for and Linux | Windows and Linux |
Hardware Support | Any GPU or CPU | NVIDIA RTX GPUs | Any GPU or CPU | Any NVIDIA GPU | Any GPU or CPU |
Model Checkpoint Format | ONNX | ONNX | GGUF or GGML | Various | PyT |
Installation Process | Pre-installed on Windows | Install SDK and Python bindings | Installation of Python packages required | Run via Podman containers | Installation of Python packages required |
LLM Support | Coming Soon | ||||
CNN Support | - | ||||
Model Optimizations | Microsoft Olive | TensorRT-Model Optimizer | Llama.cpp | Various | - |
Python | |||||
C/C++ | - | ||||
C#/.NET | - | - | - | ||
Javascript | - | - |
Latest NVIDIA News
Stay up to date on how to power your AI apps with NVIDIA RTX PCs.