AI Inference / Inference Microservices

Feb 28, 2025
Spotlight: NAVER Place Optimizes SLM-Based Vertical Services with NVIDIA TensorRT-LLM
NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of...
13 MIN READ

Feb 24, 2025
NVIDIA AI Enterprise Adds Support for NVIDIA H200 NVL
NVIDIA AI Enterprise is the cloud-native software platform for the development and deployment of production-grade AI solutions. The latest release of the NVIDIA...
4 MIN READ

Feb 14, 2025
Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding
Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...
7 MIN READ

Feb 12, 2025
Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is...
6 MIN READ

Feb 10, 2025
Just Released: Tripy, a Python Programming Model For TensorRT
Experience high-performance inference, usability, intuitive APIs, easy debugging with eager mode, clear error messages, and more.
1 MIN READ

Feb 05, 2025
OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability
Matrix multiplication and attention mechanisms are the computational backbone of modern AI workloads. While libraries like NVIDIA cuDNN provide highly optimized...
5 MIN READ

Jan 24, 2025
Optimize AI Inference Performance with NVIDIA Full-Stack Solutions
The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...
9 MIN READ

Dec 19, 2024
Enhance Your Training Data with New NVIDIA NeMo Curator Classifier Models
Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for...
11 MIN READ

Dec 18, 2024
NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference
Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...
6 MIN READ

Dec 12, 2024
Advancing Solar Irradiance Prediction with NVIDIA Earth-2
As global electricity demand continues to rise, traditional sources of energy are increasingly unsustainable. Energy providers are facing pressure to reduce...
9 MIN READ

Dec 12, 2024
Integration of NVIDIA BlueField DPUs with WEKA Client Boosts AI Workload Efficiency
WEKA, a pioneer in scalable software-defined data platforms, and NVIDIA are collaborating to unite WEKA's state-of-the-art data platform solutions with powerful...
5 MIN READ

Dec 11, 2024
NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching
NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...
4 MIN READ

Nov 21, 2024
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...
5 MIN READ

Nov 19, 2024
Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs
Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...
6 MIN READ

Nov 15, 2024
NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference
The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice...
3 MIN READ

Nov 15, 2024
Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill
In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...
4 MIN READ