AI Models
Explore and deploy top AI models built by the community, accelerated by NVIDIA’s AI inference platform, and run on NVIDIA-accelerated infrastructure.
DeepSeek
DeepSeek is a family of open-source models that features several powerful models using a mixture-of-experts (MoE) architecture and provides advanced reasoning capabilities. DeepSeek models can be optimized for performance using TensorRT-LLM for data center deployments. You can use NIM to try out the models for yourself or customize with the open-source NeMo framework.
Explore
Explore sample applications to learn about different use cases for DeepSeek models.
Integrate
Get started with the right tools and frameworks for your development environment.
Optimize
Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TensorRT-LLM.
Quantize Deepseek R1 to FP4 With TensorRT Model Optimizer
TensorRT Model Optimizer now has an experimental feature to deploy to vLLM. Check out the workflow.
Get started with the models for your development environment.
Get Production-Ready DeepSeek Models With NVIDIA NIM.
Rapid prototyping is just an API call away.
NVIDIA DeepSeek R1 FP4
The NVIDIA DeepSeek R1 FP4 model is the quantized version of the DeepSeek R1 model, which is an autoregressive language model that uses an optimized transformer architecture. The NVIDIA DeepSeek R1 FP4 model is quantized with TensorRT Model Optimizer.
DeepSeek on Ollama
Ollama lets you deploy DeepSeek quickly to all your GPUs.
Gemma
Gemma is Google DeepMind’s family of lightweight, open models. Gemma models span a variety of sizes and specialized domains to meet each developer's unique needs. NVIDIA has worked with Google to enable these models to run optimally on a variety of NVIDIA’s platforms, ensuring you get maximum performance on your hardware, from data center GPUs like NVIDIA Blackwell and NVIDIA Hopper architecture chips to Windows RTX and Jetson devices. Enterprise customers can deploy optimized containers using NVIDIA NIM microservices for production-grade support and customize using the end-to-end NeMo framework. With the latest release of Gemma 3n, these models are now natively multilingual and multimodal for your text, image, video, and audio data.
Explore
Explore sample applications to learn about different use cases for Gemma models.
Integrate
Use Gemma on your devices and make it your own.
Read the Blog: Run Google DeepMind’s Gemma 3n on NVIDIA Jetson and RTX
Optimize
Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TensorRT-LLM.
Read the Blog: NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma
Get started with the models for your development environment.
Get Started With Gemma Models With NVIDIA NIM
Gemma 3 is now featured on the NVIDIA API Catalog, enabling rapid prototyping with just an API call.
Gemma 3 Models on Ollama
Ollama lets you start experimenting in seconds with the most capable Gemma model that runs on a single NVIDIA H100 Tensor Core GPU.
Gemma-2b-it ONNX INT4
The Gemma-2b-it ONNX INT4 model is quantized with TensorRT Model Optimizer. Easily fine-tune and adapt the model to your unique requirements with Hugging Face’s Transformers library or your preferred development environment.
gpt-oss
NVIDIA and OpenAI began pushing the boundaries of AI with the launch of NVIDIA DGX™ back in 2016. The collaborative AI innovation continues with the OpenAI gpt-oss-20b and gpt-oss-120b launch. NVIDIA has optimized both new open-weight models for accelerated inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second (TPS) on an NVIDIA GB200 NVL72 system.
Explore
Explore open models and samples to learn about different use cases for NVIDIA-optimized gpt-oss models.
NVIDIA Launchable: Optimizing Inference With NVIDIA TensorRT-LLM
Integrate
Get started with the right tools and frameworks for your development environment, leveraging open gpt-oss models.
Optimize
NVIDIA has optimized both new open-weight models for accelerated inference performance on NVIDIA Blackwell architecture.
Get started with the models for your development environment.
Explore gpt-oss models on Hugging Face
NVIDIA worked with several top open-source frameworks such as Hugging Face Transformers, Ollama, and vLLM, in addition to NVIDIA TensorRT-LLM for optimized kernels and model enhancements.
Explore gpt-oss on Ollama
Developers can experience these models through their favorite apps and SDKs using Ollama, Llama.cpp, or Microsoft AI Foundry Local.
Llama
Llama is Meta’s collection of open foundation models, most recently made multimodal with the 2025 release of Llama 4. NVIDIA worked with Meta to advance inference of these models with NVIDIA TensorRT™-LLM (TRT-LLM) to get maximum performance from data center GPUs like NVIDIA Blackwell and NVIDIA Hopper™ architecture GPUs. Optimized versions of several Llama models are available as NVIDIA NIM™ microservices for an easy-to-deploy experience. You can also customize Llama with your own data using the end-to-end NVIDIA NeMo™ framework.
Explore
Explore sample applications to learn about different use cases for Llama models.
Integrate
Get started with the right tools and frameworks for your AI model development environment.
Optimize
Optimize inference workloads for large language models (LLMs) with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM.
Get started with the models for your development environment.
Get Production-Ready Llama Models With NVIDIA NIM
The NVIDIA API Catalog enables rapid prototyping with just an API call.
Llama 4 on Ollama
Ollama enable you to deploy Llama 4 quickly to all your GPUs.
Quantized Llama 3.1 8B on Hugging Face
NVIDIA Llama 3.1 8B Instruct is optimized by quantization to FP8 using the open-source TensorRT Model Optimizer library. Compatible with data center and consumer devices.
NVIDIA Nemotron
The NVIDIA Nemotron™ family of open models, including Llama Nemotron, excel in reasoning along with a diverse set of agentic tasks. The models are optimized for various use cases: Nano offers cost-efficiency, Super balances accuracy and compute, and Ultra delivers maximum accuracy. With an open license, these models ensure commercial viability and data control.
Explore
Explore models, datasets, and sample applications to learn about different use cases for Nemotron models.
Integrate
Get started with the right tools and frameworks for your development environment, leveraging open Nemotron models and datasets for agentic AI.
Optimize
Optimize Nemotron with NVIDIA NeMo and build AI agents with NVIDIA NIM and NVIDIA Blueprints with customizable reference workflows.
Get started with the models for your development environment.
Nemotron Nano
Provides superior accuracy for PC and edge devices.
The newly announced Nemotron Nano 2 supports a configurable thinking budget, enabling enterprises to control token generation to reduce cost and deploy optimized agents on edge devices.
Llama Nemotron Super
Offers the highest accuracy and throughput on a single NVIDIA H100 Tensor Core GPU.
With FP4 precision, Llama Nemotron Super 1.5 is optimized for NVIDIA Blackwell architecture with NVFP4 format, delivering up to 6x higher throughput on NVIDIA B200 compared with FP8 on NVIDIA H100.
Llama Nemotron Ultra
Delivers the leading agentic AI accuracy for complex systems, optimized for multi-GPU data centers.
Phi
Microsoft Phi is a family of Small Language Models (SLMs) that provide efficient performance for commercial and research tasks. These models are trained on high quality training data and excel in mathematical reasoning, code generation, advanced reasoning, summarization, long document QA, and information retrieval. Due to their small size, Phi models can be deployed on devices in single GPU environments, such as Windows RTX and Jetson. With the launch of the Phi-4 series of models, Phi has expanded to include advanced reasoning and multimodality.
Explore
Explore sample applications to learn about different use cases for Phi models.
Integrate
Get started with the right tools and frameworks for your development environment.
Optimize
Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM.
Get started with the models for your development environment.
Get Production-Ready Phi Models With NVIDIA NIM
The NVIDIA API Catalog enables rapid prototyping with just an API call
Phi on Ollama
Ollama lets you deploy Phi quickly to all your GPUs.
Phi-3.5-mini-Instruct INT4 ONNX
The Phi-3.5-mini-Instruct INT4 ONNX model is the quantized version of the Microsoft Phi-3.5-mini-Instruct model, which has 3.8 billion parameters.
Qwen
Alibaba recently released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models, 235B-A22B (235B total parameters and 22B active parameters) and 30B-A3B, and six dense models, including the 0.6B, 1.7B, 4B, 8B, 14B, and 32B versions. With ultra-fast token generation, developers can efficiently integrate and deploy Qwen3 models into production applications on NVIDIA GPUs, using different frameworks such as NVIDIA TensorRT-LLM, Ollama, SGLang, and vLLM.
Explore
Explore sample applications to learn about different use cases for Qwen models.
Integrate
Get started with the right tools and frameworks for your development environment.
Optimize
Optimize inference workloads for LLMs with TensorRT-LLM. Learn how to set up and get started using Llama in TRT-LLM.
Get started with the models for your development environment.
Qwen Models on NVIDIA API Catalog
Try out these powerful models capable of thinking and reasoning, that can achieve significantly enhanced performance in downstream tasks, especially hard problems.
NVIDIA NeMo canary-qwen-2.5b
NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks.
Qwen on Ollama
Ollama enables you to deploy a variety of Qwen models quickly to all your NVIDIA GPUs. Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models.
More Resources
Ethical AI
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.
Try top community models today.