CUDA C++
 
    
        
          Aug 27, 2025
        
      
      How to Improve CUDA Kernel Performance with Shared Memory Register Spilling
          When a CUDA kernel requires more hardware registers than are available, the compiler is forced to move the excess variables into local memory, a process known...
        
      
        9 MIN READ
      
      
     
    
        
          Aug 07, 2025
        
      
      Efficient Transforms in cuDF Using JIT Compilation
          RAPIDS cuDF offers a broad set of ETL algorithms for processing data with GPUs. For pandas users, cuDF accelerated algorithms are available with the zero code...
        
      
        9 MIN READ
      
      
     
    
        
          May 09, 2025
        
      
      CUDA C++ Compiler Updates Impacting ELF Visibility and Linkage
          In the next CUDA major release, CUDA 13.0, NVIDIA is introducing two significant changes to the NVIDIA CUDA Compiler Driver (NVCC) that will impact ELF...
        
      
        11 MIN READ
      
      
     
    
        
          Nov 28, 2024
        
      
      Supercharging Deduplication in pandas Using RAPIDS cuDF
          A common operation in data analytics is to drop duplicate rows. Deduplication is critical in Extract, Transform, Load (ETL) workflows, where you might want to...
        
      
        12 MIN READ
      
      
     
    
        
          Aug 08, 2024
        
      
      Improving GPU Performance by Reducing Instruction Cache Misses
          GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming...
        
      
        12 MIN READ
      
      
     
    
        
          Nov 13, 2023
        
      
      Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip
          The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most...
        
      
        17 MIN READ
      
      
     
    
        
          Aug 22, 2023
        
      
      Simplifying GPU Application Development with Heterogeneous Memory Management
          Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming...
        
      
        16 MIN READ
      
      
     
    
        
          Apr 20, 2023
        
      
      Debugging a Mixed Python and C Language Stack
          Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill...
        
      
        18 MIN READ
      
      
     
    
        
          Jun 23, 2022
        
      
      Just Released: CUTLASS v2.9
          The latest version of CUTLASS offers users BLAS3 operators accelerated by tensor cores, Python integrations, GEMM compatibility extensions, and more.
        
      
         1 MIN READ
      
      
     
    
        
          Mar 23, 2022
        
      
      Boosting Application Performance with GPU Memory Prefetching
          NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...
        
      
        10 MIN READ
      
      
     
    
        
          Feb 10, 2022
        
      
      Implementing High-Precision Decimal Arithmetic with CUDA int128
          “Truth is much too complicated to allow anything but approximations.” -- John von Neumann The history of computing has demonstrated that there is no limit...
        
      
        19 MIN READ
      
      
     
    
        
          Dec 08, 2020
        
      
      Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager
          When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high...
        
      
        24 MIN READ
      
      
     
    
        
          Jun 19, 2017
        
      
      Unified Memory for CUDA Beginners
          My previous introductory post, "An Even Easier Introduction to CUDA C++", introduced the basics of CUDA programming by showing how to write a simple program...
        
      
        16 MIN READ
      
      
     
    
        
          Feb 23, 2016
        
      
      High-Performance Geometric Multi-Grid with GPU Acceleration
          Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an...
        
      
        16 MIN READ
      
      
     
    
        
          Oct 19, 2015
        
      
      Cutting Edge Parallel Algorithms Research with CUDA
          Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par...
        
      
        14 MIN READ
      
      
     
    
        
          Oct 12, 2015
        
      
      Accelerating Materials Discovery with CUDA
          In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence...
        
      
        15 MIN READ