NVIDIA AI Inference Software

Fast and scalable AI in every application.

NVIDIA’s inference software delivers fast and scalable AI in every application

There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, visual search, and personalized recommendations. At the same time, datasets are growing, models are getting more complex, and diverse infrastructure and latency requirements are tightening to meet user expectations. NVIDIA’s inference software delivers the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, at the network’s edge, and in embedded devices.

How NVIDIA AI Inference Works

A workflow diagram showing how NVIDIA AI Inference works

NVIDIA AI Inference Software

Triton Inference Server simplifies inference and increases inference performance

Triton Inference Server

NVIDIA Triton™ Inference Server is an open-source model-serving software that delivers fast and scalable AI in every application. Triton Inference Server lets teams deploy trained AI models from any framework (TensorFlow, PyTorch, XGBoost, ONNX, Python, and more) on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.

 TensorRT accelerates every inference platform


NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference that unlocks the power of NVIDIA Tensor Core GPUs. It delivers up to 36X higher throughput than CPU-only platforms. while minimizing latency. Using TensorRT, you can start from any deep learning framework and rapidly optimize, validate, and run trained neural networks in production.


Stay current on the latest NVIDIA Triton™ Inference Server and NVIDIA TensorRT™ product updates, content, news, and more.