Posts by Matthew Nicely
Agentic AI / Generative AI
Jun 15, 2026
Boosting MoE Training Throughput with Advanced Fusion Kernels
Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable...
9 MIN READ
Networking / Communications
Jul 22, 2025
Understanding NCCL Tuning to Accelerate GPU-to-GPU Communication
The NVIDIA Collective Communications Library (NCCL) is essential for fast GPU-to-GPU communication in AI workloads, using various optimizations and tuning to...
14 MIN READ
Networking / Communications
Jul 14, 2025
NCCL Deep Dive: Cross Data Center Communication and Network Topology Awareness
As the scale of AI training increases, a single data center (DC) is not sufficient to deliver the required computational power. Most recent approaches to...
9 MIN READ
Data Center / Cloud
Feb 03, 2025
Just Released: CUTLASS 3.8
Provides support for the NVIDIA Blackwell SM100 architecture. CUTLASS is a collection of CUDA C++ templates and abstractions for implementing high-performance...
1 MIN READ
Data Center / Cloud
Jan 31, 2025
Just Released: NVIDIA cuDNN 9.7
Bringing support for NVIDIA Blackwell architecture across data center and GeForce products, NVIDIA cuDNN 9.7 delivers speedups of up to 84% for FP8 Flash...
1 MIN READ
Agentic AI / Generative AI
May 24, 2024
Accelerating Transformers with NVIDIA cuDNN 9
The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library for accelerating deep learning primitives with state-of-the-art performance....
12 MIN READ