Production Deep Learning Inference with TensorRT Inference Server

In the video below, watch how TensorRT Inference server can improve deep learning inference performance and production data center utilization.

TensorRT inference server:

Simplifies deploying AI inference.
Maximizes GPU utilization with concurrent execution of AI models
Increases high inference throughout and scales to peak loads

Whether it’s performing object detection in images or video, recommending restaurants, or translating the spoken word, inference is the mechanism that allows applications to derive valuable information from trained AI models. However many inference solutions are one-off designs that lack the performance and flexibility to be seamlessly deployed in modern production data center environments.

NVIDIA TensorRT Inference Server lets you simplify the deployment of inference applications in data centers.

Delivered as a ready-to-deploy container from NGC and as an open source project, TensorRT Inference Server is a microservice that enables applications to use AI models in data center production. It also supports top AI frameworks and custom backends, and it maximizes utilization by running multiple models concurrently per GPU and across multiple GPUs with dynamic request batching.

TensorRT Inference Server also seamlessly supports Kubernetes with health and latency metrics, and integrates with Kubeflow for simplified deployment.