CUDA
Oct 24, 2025
Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS
NVIDIA CUDA-X math libraries provide the fundamental numerical building blocks that enable developers to deploy accelerated applications across multiple...
11 MIN READ
Oct 14, 2025
Understanding Memory Management on Hardware-Coherent Platforms
If you're an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an...
6 MIN READ
Sep 29, 2025
Unlock GPU Performance: Global Memory Access in CUDA
Managing memory is one of the most important performance characteristics to consider when writing a GPU kernel. This post walks you through the important...
15 MIN READ
Sep 16, 2025
Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200
Computer-aided engineering (CAE) forms the backbone for modern product development across industries, from designing safer aircraft to optimizing renewable...
8 MIN READ
Sep 11, 2025
Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6
The constantly increasing compute throughput of NVIDIA GPUs presents a new opportunity for optimizing vision AI workloads: keeping the hardware fed with data....
13 MIN READ
Sep 10, 2025
Developers Can Now Get NVIDIA CUDA Directly from Their Favorite Third-Party Platforms
Building and deploying applications can be challenging for developers, requiring them to navigate the complex relationship between hardware and software...
3 MIN READ
Sep 03, 2025
Accelerate Autonomous Vehicle Development with the NVIDIA DRIVE AGX Thor Developer Kit
Autonomous vehicle (AV) technology is rapidly evolving, fueled by ever-larger and more complex AI models deployed at the edge. Modern vehicles now require not...
8 MIN READ
Sep 02, 2025
Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2
Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a...
8 MIN READ
Sep 02, 2025
What’s New in CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
The world of embedded and edge computing is about to get faster, more efficient, and more versatile with the upcoming CUDA 13.0 release for Jetson Thor SoC...
12 MIN READ
Aug 27, 2025
How to Improve CUDA Kernel Performance with Shared Memory Register Spilling
When a CUDA kernel requires more hardware registers than are available, the compiler is forced to move the excess variables into local memory, a process known...
9 MIN READ
Aug 13, 2025
Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants
If you’ve ever installed an NVIDIA GPU-accelerated Python package, you’ve likely encountered a familiar dance: navigating to pytorch.org, jax.dev,...
15 MIN READ
Aug 06, 2025
What’s New and Important in CUDA Toolkit 13.0
The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs. As a major release, it...
19 MIN READ
Aug 04, 2025
CUDA Pro Tip: Increase Performance with Vectorized Memory Access
Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...
6 MIN READ
Aug 04, 2025
Navigating GPU Architecture Support: A Guide for NVIDIA CUDA Developers
If you’ve used the NVIDIA CUDA Compiler (NVCC) for your NVIDIA GPU application recently, you may have encountered a warning message like the following: nvcc...
6 MIN READ
Jul 18, 2025
Optimizing for Low-Latency Communication in Inference Workloads with JAX and XLA
Running inference with large language models (LLMs) in production requires meeting stringent latency constraints. A critical stage in the process is LLM decode,...
6 MIN READ
Jul 16, 2025
CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design
GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and...
12 MIN READ