AI Inference
Sep 10, 2024
Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer
As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of...
10 MIN READ
Sep 05, 2024
Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch
As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...
5 MIN READ
Aug 28, 2024
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs
The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...
7 MIN READ
Aug 28, 2024
NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks
Six years ago, we embarked on a journey to develop an AI inference serving solution specifically designed for high-throughput and time-sensitive production use...
8 MIN READ
Aug 21, 2024
Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale
Deploying AI-enabled applications and services presents enterprises with significant challenges: Performance is critical as it directly shapes user...
6 MIN READ
Aug 21, 2024
Practical Strategies for Optimizing LLM Inference Sizing and Performance
As the use of large language models (LLMs) grows across many applications, such as chatbots and content creation, it's important to understand the process of...
2 MIN READ
Aug 20, 2024
Deploy the First On-Device Small Language Model for Improved Game Character Roleplay
At Gamescom 2024, NVIDIA announced our first on-device small language model (SLM) for improving the conversation abilities of game characters. We also announced...
4 MIN READ
Aug 15, 2024
NVIDIA TensorRT Model Optimizer v0.15 Boosts Inference Performance and Expands Model Support
NVIDIA has announced the latest v0.15 release of NVIDIA TensorRT Model Optimizer, a state-of-the-art quantization toolkit of model optimization techniques...
5 MIN READ
Aug 14, 2024
Optimizing Inference Efficiency for LLMs at Scale with NVIDIA NIM Microservices
As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize...
8 MIN READ
Aug 12, 2024
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference
Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements...
8 MIN READ
Aug 07, 2024
Optimizing llama.cpp AI Inference with CUDA Graphs
The open-source llama.cpp code base was originally released in 2023 as a lightweight but efficient framework for performing inference on Meta Llama models....
8 MIN READ
Aug 06, 2024
A Deep Dive into the Latest AI Models Optimized with NVIDIA NIM
Delivered as optimized containers, NVIDIA NIM microservices are designed to accelerate AI application development for businesses of all sizes, paving the way...
9 MIN READ
Jul 30, 2024
Enhancing RAG Pipelines with Re-Ranking
In the rapidly evolving landscape of AI-driven applications, re-ranking has emerged as a pivotal technique to enhance the precision and relevance of enterprise...
8 MIN READ
Jul 15, 2024
Power Your AI Projects with New NVIDIA NIMs for Mistral and Mixtral Models
Large language models (LLMs) are growing in adoption across enterprise organizations, with many building them into their AI applications. Foundation models are...
5 MIN READ
Jul 02, 2024
Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and NVIDIA TensorRT-LLM
As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to...
9 MIN READ
Jun 12, 2024
Demystifying AI Inference Deployments for Trillion Parameter Large Language Models
AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as...
14 MIN READ