Posts by Laikh Tewari
Agentic AI / Generative AI
Dec 16, 2025
Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLM
For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs...
6 MIN READ
Developer Tools & Techniques
Jul 07, 2025
Think Smart and Ask an Encyclopedia-Sized Question: Multi-Million Token Real-Time Inference for 32X More Users
Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents...
8 MIN READ
Agentic AI / Generative AI
Jan 16, 2025
Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM
Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the...
7 MIN READ
Agentic AI / Generative AI
Dec 02, 2024
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x
NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...
9 MIN READ