NVIDIA AI Inference Software
Fast and scalable AI in every application.
Triton Inference Server TensorRTWhat is AI Inference?

There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, visual search, image and text generation, and personalized recommendations. Inference is running the trained AI models to make predictions. This is the production phase of AI and successful inference should be easy to deploy, run, and scale and highly performant meeting the application requirements. NVIDIA’s inference software delivers the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, at the network’s edge, and in embedded devices.
How does NVIDIA AI Inference Work?
NVIDIA Triton Inference Server can be used to deploy, run and scale trained models from all major frameworks (TensorFlow, PyTorch, XGBoost, and others) on the cloud, on-prem data center, edge, or embedded devices. NVIDIA TensorRT is an optimization compiler and runtime that uses multiple techniques like quantization, fusion, and kernel tuning, to optimize a trained deep learning model to deliver orders of magnitude performance improvements. NVIDIA AI inference supports models of all sizes and scales, for different use cases such as speech AI, natural language processing (NLP), computer vision, generative AI, recommenders, and more.
Access this whitepaper to explore the evolving inference landscape, architectural considerations for the optimal inference accelerator, and NVIDIA’s AI platform.
Explore the NVIDIA Inference Solution
Deploy models from all major AI frameworks like TensorFlow, PyTorch, ONNX, XGBoost, Python, JAX, or even custom.
Power High Throughput, Low Latency
Deliver high-throughput and low-latency inference across computer vision, speech AI, NLP, recommender systems, and more.
Deploy Anywhere
Deploy, run, and scale optimized AI models consistently on cloud, on prem, at the edge, and on embedded devices.
Take a Closer Look at NVIDIA AI Inference Software

Triton Inference Server
NVIDIA Triton™ Inference Server is an open-source, model-serving software that delivers fast and scalable AI in every application. Triton Inference Server lets teams deploy trained AI models and pipelines from any framework (TensorFlow, PyTorch, XGBoost, ONNX, Python, and more) on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.
Learn More
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference that unlocks the power of NVIDIA Tensor Core GPUs. It delivers up to 36X higher throughput than CPU-only platforms. while minimizing latency. Using TensorRT, you can start from any deep learning framework and rapidly optimize, validate, and run trained neural networks in production.
Learn More

Free Hands-On AI Labs on NVIDIA LaunchPad
Experience NVIDIA Triton and NVIDIA TensorRT through any of the following free hands-on labs on hosted infrastructure:
- Deploy Fraud Detection XGBoost Model With NVIDIA Triton
- Train and Deploy an AI Support Chatbot
- Build AI-Based Cybersecurity Solutions
- Tuning and Deploying a Language Model on NVIDIA H100
- And many more…

NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, secure, cloud-native suite of AI software—which includes NVIDIA Triton Inference Server and TensorRT—that streamlines AI development and deployment.
- An integrated and validated platform for enterprise AI
- Speeds time to production with AI workflows and pre-trained models, improves efficiency, and cost savings.
- Is optimized and certified to deploy everywhere—cloud, data center, edge
- Comes with enterprise-grade support, security, and API stability
Learn More
Stay current on the latest NVIDIA Triton Inference Server and NVIDIA TensorRT product updates, content, news, and more.