NVIDIA AI Inference Software

Fast and scalable AI in every application.

Triton Inference Server TensorRT

What is AI Inference?

NVIDIA’s inference software delivers fast and scalable AI in every application

There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, visual search, image and text generation, and personalized recommendations. Inference is running the trained AI models to make predictions. This is the production phase of AI and successful inference should be easy to deploy, run, and scale and highly performant meeting the application requirements. NVIDIA’s inference software delivers the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, at the network’s edge, and in embedded devices.

How does NVIDIA AI Inference Work?

A workflow diagram showing how NVIDIA AI Inference works

NVIDIA Triton Inference Server can be used to deploy, run and scale trained models from all major frameworks (TensorFlow, PyTorch, XGBoost, and others) on the cloud, on-prem data center, edge, or embedded devices. NVIDIA TensorRT is an optimization compiler and runtime that uses multiple techniques like quantization, fusion, and kernel tuning, to optimize a trained deep learning model to deliver orders of magnitude performance improvements. NVIDIA AI inference supports models of all sizes and scales, for different use cases such as speech AI, natural language processing (NLP), computer vision, generative AI, recommenders, and more.

Access this whitepaper to explore the evolving inference landscape, architectural considerations for the optimal inference accelerator, and NVIDIA’s AI platform.

Explore the NVIDIA Inference Solution

NVIDIA AI Inference allows model deployment from multiple frameworksUse Multiple Frameworks

Deploy models from all major AI frameworks like TensorFlow, PyTorch, ONNX, XGBoost, Python, JAX, or even custom.

NVIDIA AI Inference delivers high throughput and low latency.

Power High Throughput, Low Latency

Deliver high-throughput and low-latency inference across computer vision, speech AI, NLP, recommender systems, and more.

NVIDIA AI Inference can deploy AI models anywhere

Deploy Anywhere

Deploy, run, and scale optimized AI models consistently on cloud, on prem, at the edge, and on embedded devices.

Take a Closer Look at NVIDIA AI Inference Software

Triton Inference Server simplifies inference and increases inference performance

Triton Inference Server

NVIDIA Triton™ Inference Server is an open-source, model-serving software that delivers fast and scalable AI in every application. Triton Inference Server lets teams deploy trained AI models and pipelines from any framework (TensorFlow, PyTorch, XGBoost, ONNX, Python, and more) on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.

Learn More
 TensorRT accelerates every inference platform


NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference that unlocks the power of NVIDIA Tensor Core GPUs. It delivers up to 36X higher throughput than CPU-only platforms. while minimizing latency. Using TensorRT, you can start from any deep learning framework and rapidly optimize, validate, and run trained neural networks in production.

Learn More

Triton Inference Server simplifies inference and increases inference performance

Free Hands-On AI Labs on NVIDIA LaunchPad

Experience NVIDIA Triton and NVIDIA TensorRT through any of the following free hands-on labs on hosted infrastructure:

  • Deploy Fraud Detection XGBoost Model With NVIDIA Triton
  • Train and Deploy an AI Support Chatbot
  • Build AI-Based Cybersecurity Solutions
  • Tuning and Deploying a Language Model on NVIDIA H100
  • And many more…
Learn More
TensorRT accelerates every inference platform

NVIDIA AI Enterprise

NVIDIA AI Enterprise is an end-to-end, secure, cloud-native suite of AI software—which includes NVIDIA Triton Inference Server and TensorRT—that streamlines AI development and deployment.

  • An integrated and validated platform for enterprise AI
  • Speeds time to production with AI workflows and pre-trained models, improves efficiency, and cost savings.
  • Is optimized and certified to deploy everywhere—cloud, data center, edge
  • Comes with enterprise-grade support, security, and API stability

Learn More

Stay current on the latest NVIDIA Triton Inference Server and NVIDIA TensorRT product updates, content, news, and more.

Sign Up