What is AI Inference?
AI inference is the process of using a trained model to make predictions on never-seen-before data. During training, an AI model learns the patterns and relationships that enable it to generalize on new data. During inference, the model applies its learned knowledge to provide accurate predictions or generate outputs such as images, text, or video. This powerful capability allows businesses to make data-driven decisions, optimize processes, and deliver unique, personalized experiences for internal and external customers.
Next-Generation AI Inference With NVIDIA AI Software
There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, image and text generation, and personalized recommendations. Inference is running the trained AI models to make predictions. NVIDIA’s inference software delivers the performance and efficiency critical to powering the next generation of AI products and services.
How Does NVIDIA AI Inference Work?
NVIDIA AI Enterprise, an enterprise-grade AI software platform built for production inference, consists of key NVIDIA inference technologies and tools. NVIDIA AI inference supports models of all sizes and scales for different use cases such as speech AI, natural language processing (NLP), computer vision, generative AI, recommenders, and more.
NVIDIA TensorRT is an optimization compiler and runtime that uses multiple techniques like quantization, fusion, and kernel tuning to optimize trained deep learning models. NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance on the latest LLMs on NVIDIA GPUs. NVIDIA Triton Inference Server™ can be used to deploy, run, and scale trained models from all major frameworks on the cloud, in on-prem data centers, at the edge, or on embedded devices.
Discover the modern landscape of AI inference, production use cases from companies, and real-world challenges and solutions.
Explore the NVIDIA Inference Solution
Use Multiple Frameworks
Deploy models from all major AI frameworks like TensorFlow, PyTorch, ONNX, XGBoost, Python, JAX, or even custom.
Power High Throughput, Low Latency
Deliver high-throughput and low-latency inference across computer vision, speech AI, NLP, recommender systems, and more.
Deploy, run, and scale optimized AI models consistently on cloud, on prem, at the edge, and on embedded devices.
NVIDIA AI Inference Software
NVIDIA AI Enterprise is an end-to-end AI software platform consisting of NVIDIA TensorRT, NVIDIA TensorRT-LLM, NVIDIA Triton Inference Server, and other tools to simplify building, sharing, and deploying AI applications. With enterprise-grade support, stability, manageability, and security, enterprises can accelerate time to value while eliminating unplanned downtime.
NVIDIA Triton Inference Server
NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize AI model deployment and execution in production from all major AI frameworks on any GPU- or CPU-based infrastructure.Learn More
NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. TensorRT can be deployed, run, and scaled with Triton.
TensorRT-LLM is an open-source library for defining, optimizing, and executing large language models (LLM) for inference in production. It maintains the core functionality of FasterTransformer, paired with TensorRT’s Deep Learning Compiler, in an open source Python API to quickly support new models and customizations.
Stay current on the latest NVIDIA AI inference software product updates, content, news, and more.