Data Center / Cloud

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice containers for AI model inference, constantly improving enterprise-grade generative AI performance. With the upcoming NIM version 1.4 scheduled for release in early December, request performance is improved by up to 2.4x out-of-the-box with the same single-command deployment experience.  

At the core of NIM are multiple LLM inference engines, including NVIDIA TensorRT-LLM, which enables it to achieve speed-of-light inference performance. With each release, NIM incorporates the latest advancements in kernel optimizations, memory management, and scheduling from these engines to improve performance. 

The image shows a chart of throughput in tokens per second per user for the Llama 3.1 8B NIM version 1.4 versus the Llama 3.1 8B NIM version 1.2, demonstrating up to 2.4 faster token generation for NIM 1.4 compared with NIM 1.2.
Figure 1. NVIDIA NIM 1.4 throughput compared to NIM 1.2. Llama 3.1 70B 2xH100-SXM input tokens 8K, output tokens 256. Llama 3.1 8B 1xH100-SXM input tokens 30K, output tokens 256

In NIM 1.4, significant improvements in kernel efficiency, runtime heuristics, and memory allocation were added, translating into up to 2.4x faster inferencing, compared to NIM 1.2. These advancements are crucial for businesses that rely on quick responses and high throughput for generative AI applications.

NIM also benefits from continuous updates to full-stack accelerated computing, which enhances performance and efficiency at every level of the computing stack. This includes support for the latest NVIDIA TensorRT and NVIDIA CUDA versions, further boosting inference performance. NIM users benefit from these continuous improvements without manually updating software.

An image shows a chart of request latency in seconds across different request per-second values for the Llama 3.1 8B NIM version 1.4 versus the Llama 3.1 8B NIM version 1.2, showing 2x faster request latency for NIM 1.4 compared with NIM 1.2.
Figure 2. NVIDIA Llama 3.1 8B NIM 1.4 versus Llama 3.1 8B NIM 1.2 running on 1x H100SXM, input tokens 30K, output tokens 256

NIM brings together a full suite of preconfigured software to deliver high-performance AI inferencing with minimal setup, enabling developers to quickly get started with high-performance inference. 

A continuous innovation loop means that every improvement in TensorRT-LLM, CUDA, and other core accelerated computing technologies immediately benefits NIM users. Updates are seamlessly integrated and delivered through updates to NIM microservice containers, eliminating the need for manual configuration and reducing the engineering overhead typically associated with maintaining high-performance inference solutions.

Get started today

NVIDIA NIM is the fastest path to high-performance generative AI without the complexity of traditional model deployment and management. With enterprise-grade reliability and support plus continuous performance enhancements, NIM makes high-performance AI inferencing accessible to enterprises. Learn more and get started today.

Discuss (0)

Tags