TensorRT

Apr 24, 2025
Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM
Researchers from the University College London (UCL) Deciding, Acting, and Reasoning with Knowledge (DARK) Lab leverage NVIDIA NIM microservices in their new...
7 MIN READ

Apr 21, 2025
Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT
State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant...
8 MIN READ

Apr 02, 2025
NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0
The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...
10 MIN READ

Mar 18, 2025
Seamlessly Scale AI Across Cloud Environments with NVIDIA DGX Cloud Serverless Inference
NVIDIA DGX Cloud Serverless Inference is an auto-scaling AI inference solution that enables application deployment with speed and reliability. Powered by NVIDIA...
9 MIN READ

Mar 18, 2025
NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance
NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...
14 MIN READ

Mar 10, 2025
Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK
Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of...
7 MIN READ

Feb 28, 2025
Spotlight: NAVER Place Optimizes SLM-Based Vertical Services with NVIDIA TensorRT-LLM
NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of...
13 MIN READ

Feb 10, 2025
Just Released: Tripy, a Python Programming Model For TensorRT
Experience high-performance inference, usability, intuitive APIs, easy debugging with eager mode, clear error messages, and more.
1 MIN READ

Jan 30, 2025
New AI SDKs and Tools Released for NVIDIA Blackwell GeForce RTX 50 Series GPUs
NVIDIA recently announced a new generation of PC GPUs—the GeForce RTX 50 Series—alongside new AI-powered SDKs and tools for developers. Powered by the...
6 MIN READ

Jan 24, 2025
Optimize AI Inference Performance with NVIDIA Full-Stack Solutions
The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...
9 MIN READ

Dec 18, 2024
NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference
Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...
6 MIN READ

Dec 11, 2024
NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching
NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...
4 MIN READ

Nov 21, 2024
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...
5 MIN READ

Nov 19, 2024
Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs
Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...
6 MIN READ

Nov 15, 2024
NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference
The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice...
3 MIN READ

Nov 08, 2024
5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse
In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...
5 MIN READ