Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

Generative AI and large language models (LLMs) are changing human-computer interaction as we know it. Many use cases would benefit from running LLMs locally on Windows PCs, including gaming, creativity, productivity, and developer experiences. This post discusses several NVIDIA end-to-end developer tools for creating and deploying both text-based and visual LLM applications on NVIDIA RTX AI-ready PCs.

Developer tools for building text-based generative AI projects

NVIDIA TensorRT-LLM is an open-source large language model (LLM) inference library. It provides an easy-to-use Python API to define LLMs and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. NVIDIA TensorRT-LLM also contains components to create Python and C++ runtimes to run inference with the generated TensorRT engines.

To get started with TensorRT-LLM, visit the NVIDIA/TensorRT-LLM GitHub repo. Check out the TensorRT-LLM for Windows developer environment setup details.

In desktop applications, model quantization is crucial for compatibility with PC GPUs, which often have limited VRAM. TensorRT-LLM facilitates this process through its support for model quantization, enabling models to occupy a smaller memory footprint with the help of the TensorRT-LLM Quantization Toolkit.

To start exploring post-training quantization using TensorRT-LLM Quantization Toolkit, see the TensorRT-LLM Quantization Toolkit Installation Guide on GitHub.

Model compatibility and pre-optimized models

TensorRT-LLM provides the capability to define models through its Python API and is pre-equipped to support a diverse range of LLMs. Quantized model weights are available, specifically optimized for NVIDIA RTX PCs on NVIDIA GPU Cloud (NGC), enabling rapid deployment of these models.

Model Name	Model Location
Llama 2 7B – Int4-AWQ	Download
Llama 2 13B – Int4-AWQ	Download
Code Llama 13B – Int4-AWQ	Download
Mistral 7B – Int4-AWQ	Download

Table 1. Pre-optimized text-based LLMs that run on Windows PC for NVIDIA RTX with the NVIDIA TensorRT-LLM backend

You can also build TensorRT engines for a wide variety of models supported by TensorRT-LLM. Visit TensorRT-LLM/examples on GitHub to see all supported models.

Developer resources and reference applications

Check out these reference projects for more information:

TRT-LLM RAG on Windows: This repository demonstrates a retrieval-augmented generation (RAG) pipeline, using llama_index on Windows with Llama 2 13B – int4, TensorRT-LLM, and FAISS.
OpenAI API Spec Web Server: Drop-in replacement REST API compatible with OpenAI API spec using TensorRT-LLM as the inference backend.

Minimum system requirements

Supported GPU architectures for TensorRT-LLM include NVIDIA Ampere and above, with a minimum of 8GB RAM. It is suggested to use Windows 11 and above, for an optimal experience.

Developer tools for building visual generative AI projects

NVIDIA TensorRT SDK is a high-performance deep learning inference optimizer. It provides layer fusion, precision calibration, kernel auto-tuning, and other capabilities that significantly boost the efficiency and speed of deep learning models. This makes it indispensable for real-time applications and resource-intensive models like Stable Diffusion, substantially accelerating performance. Get started with NVIDIA TensorRT.

For broader guidance on how to integrate TensorRT into your applications, see Getting Started with NVIDIA AI for Your Applications. Learn how to profile your pipeline to pinpoint where optimization is critical and where minor changes can have a significant impact. Accelerate your AI pipeline by choosing a machine learning framework, and discover SDKs for video, graphic design, photography, and audio.

Developer demos and reference applications

Check out these resources for more information:

How to Optimize Models like Stable Diffusion with TensorRT: This demo notebook showcases the acceleration of Stable Diffusion inference pipeline using TensorRT through Hugging Face.
Example TRT Pipeline for Stable Diffusion: An example of how TensorRT can be used to accelerate the text-to-image Stable Diffusion inference pipeline.
TensorRT Extension for Stable Diffusion Web UI: A working example of TensorRT accelerating the most popular Stable Diffusion web UI.

Summary

Use the resources in this post to easily add generative AI capabilities to applications powered by the existing installed base of 100 million NVIDIA RTX PCs.

Share what you develop with the NVIDIA developer community by entering the NVIDIA Generative AI on NVIDIA RTX Developer Contest for a chance to win a GeForce RTX 4090 GPU, a full in-person NVIDIA GTC conference pass, and more.

Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

Developer tools for building text-based generative AI projects

Model compatibility and pre-optimized models

Developer resources and reference applications

Minimum system requirements

Developer tools for building visual generative AI projects

Developer demos and reference applications

Summary

Related resources

Tags

About the Authors

Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

Developer tools for building text-based generative AI projects

Model compatibility and pre-optimized models

Developer resources and reference applications

Minimum system requirements

Developer tools for building visual generative AI projects

Developer demos and reference applications

Summary

Related resources

Tags

About the Authors

Comments

Related posts

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton

Supercharging LLM Applications on Windows PCs with NVIDIA RTX Systems

Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Related posts

Mistral Large and Mixtral 8x22B LLMs Now Powered by NVIDIA NIM and NVIDIA API

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM

NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records

NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma