How AI Inference Works
AI inference follows induction, also known as training. Induction is the process of creating models by inducing algorithms like neural networks with labeled data. The model learns to predict expected outcomes by learning and generalizing patterns in the labeled training data. Then the model is tested and validated on unseen data to ensure its quality. Once the model passes testing, it can be used in production for inference. Inference is the process of providing unlabeled data to a model, and the model returns information or a label for the input data. There are many types of applications for inference, like LLMs, forecasts, and predictive analysis. At its core, all inference in neural networks is inputting numbers and outputting numbers. Processing of data before and after inference is what differentiates the types of inference. For example, in an LLM, a prompt has to be turned into numbers for the input, and the output numbers have to be turned into words.
Explore AI Inference Software, Tools, and Technologies
NVIDIA NIM
NVIDIA NIM™ provides easy-to-use microservices for secure, reliable deployment of high-performance AI inferencing across the cloud, data center, and workstations.
NVIDIA Triton Inference Server
Use NVIDIA Triton Inference Server™ to combine custom AI model-serving infrastructure, boost AI inferencing and prediction abilities, and simplify the creation of custom AI pipelines with pre- and post-processing steps and business logic.
NVIDIA TensorRT
NVIDIA TensorRT™ includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.