Advanced Technical

Jul 09, 2026

A Practical Guide to GPU-Initiated Communication for Molecular Dynamics at Scale

Molecular dynamics (MD) simulations are among the most demanding workloads in computational science. Using them, researchers can observe atomic behavior in...

21 MIN READ

Jun 25, 2026

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the...

11 MIN READ

Jun 24, 2026

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

An increasingly common design pattern for autonomous vehicles (AVs), robotics, and spatial AI systems is bird's-eye-view (BEV) perception. BEV models project...

15 MIN READ

Jun 09, 2026

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

This post is the third of a three-part series. See also Model Quantization: Concepts, Methods, and Why It Matters and Model Quantization: Post-Training...

10 MIN READ

May 21, 2026

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as...

10 MIN READ

Apr 22, 2026

Simplify Sparse Deep Learning with Universal Sparse Tensor in nvmath-python

In a previous post, we introduced the Universal Sparse Tensor (UST), enabling developers to decouple a tensor’s sparsity from its memory layout for greater...

11 MIN READ

Apr 07, 2026

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly...

11 MIN READ

Apr 02, 2026

Achieving Single-Digit Microsecond Latency Inference for Capital Markets

In algorithmic trading, reducing response times to market events is crucial. To keep pace with high-speed electronic markets, latency-sensitive firms often use...

13 MIN READ

Mar 25, 2026

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

9 MIN READ

Mar 12, 2026

Build Accelerated, Differentiable Computational Physics Code for AI with NVIDIA Warp

Computer-aided engineering (CAE) is shifting from human-driven workflows toward AI-driven ones, including physics foundation models that generalize across...

18 MIN READ

Mar 05, 2026

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

In this post, we dive into one of the most critical workloads in modern AI: Flash Attention, where you’ll learn: How to implement Flash Attention using NVIDIA...

20 MIN READ

Feb 27, 2026

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes...

11 MIN READ

Feb 02, 2026

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. EP communication is essentially all-to-all,...

11 MIN READ

Jan 30, 2026

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing,...

15 MIN READ

A global image showing weather patterns.

Jan 26, 2026

How to Unlock Local Detail in Coarse Climate Projections with NVIDIA Earth-2

Global climate models are good at the big picture—but local climate extremes, like hurricanes and typhoons, often disappear in the details. Those patterns are...

12 MIN READ

Jan 13, 2026

Learn How NVIDIA cuOpt Accelerates Mixed Integer Optimization using Primal Heuristics

NVIDIA cuOpt is a GPU-accelerated optimization engine designed to deliver fast, high-quality solutions for large, complex decision-making problems. Mixed...

7 MIN READ