Data Center / Cloud

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

Nov 15, 2024

By Bethann Noble, Cherie Wang and Carl (Izzy) Putterman

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA NIM provides production-ready microservice containers for AI model inference, improving enterprise-grade generative AI performance.
The upcoming NIM version 1.4 is expected to improve request performance by up to 2.4x out-of-the-box with the same single-command deployment experience.
NIM achieves high-performance inference through multiple LLM inference engines, including NVIDIA TensorRT-LLM, and benefits from continuous updates to full-stack accelerated computing.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice containers for AI model inference, constantly improving enterprise-grade generative AI performance. With the upcoming NIM version 1.4 scheduled for release in early December, request performance is improved by up to 2.4x out-of-the-box with the same single-command deployment experience.

At the core of NIM are multiple LLM inference engines, including NVIDIA TensorRT-LLM, which enables it to achieve speed-of-light inference performance. With each release, NIM incorporates the latest advancements in kernel optimizations, memory management, and scheduling from these engines to improve performance.

In NIM 1.4, significant improvements in kernel efficiency, runtime heuristics, and memory allocation were added, translating into up to 2.4x faster inferencing, compared to NIM 1.2. These advancements are crucial for businesses that rely on quick responses and high throughput for generative AI applications.

NIM also benefits from continuous updates to full-stack accelerated computing, which enhances performance and efficiency at every level of the computing stack. This includes support for the latest NVIDIA TensorRT and NVIDIA CUDA versions, further boosting inference performance. NIM users benefit from these continuous improvements without manually updating software.

NIM brings together a full suite of preconfigured software to deliver high-performance AI inferencing with minimal setup, enabling developers to quickly get started with high-performance inference.

A continuous innovation loop means that every improvement in TensorRT-LLM, CUDA, and other core accelerated computing technologies immediately benefits NIM users. Updates are seamlessly integrated and delivered through updates to NIM microservice containers, eliminating the need for manual configuration and reducing the engineering overhead typically associated with maintaining high-performance inference solutions.

Get started today

NVIDIA NIM is the fastest path to high-performance generative AI without the complexity of traditional model deployment and management. With enterprise-grade reliability and support plus continuous performance enhancements, NIM makes high-performance AI inferencing accessible to enterprises. Learn more and get started today.

Discuss (0)

About the Authors

About Bethann Noble
Bethann Noble is a product marketing manager for enterprise software products at NVIDIA, including the NVIDIA AI Enterprise software platform with NVIDIA NIM. Previously, she held senior positions in marketing and product marketing at AI copilot startup Continual, AI-powered bot protection platform HUMAN Security, Cloudera, and IBM. Bethann has a bachelor’s degree in mathematics from the University of Texas at Austin.

View all posts by Bethann Noble

About Cherie Wang
Cherie Wang is an AI product manager at NVIDIA, a member of the NeMo-LLM product team, and responsible for NVIDIA Merlin. Before NVIDIA, Cherie managed an enterprise SaaS platform at DiDi Labs. She has nine years of software product management experience, with six years focusing on recommender systems and AI-based products. She holds a bachelor’s degree in Intelligence and Science from Perking University, China.

View all posts by Cherie Wang

About Carl (Izzy) Putterman
Carl (Izzy) Putterman is a senior deep learning algorithms engineer. He graduated from the University of California, Berkeley in 2021 with BAs in applied mathematics and computer science. With NVIDIA, he currently works on NIM LLM Performance and has previously worked on time series modeling and graph neural networks.

View all posts by Carl (Izzy) Putterman

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

Get started today

Tags

About the Authors

Comments