With its largest advancement since the NVIDIA CUDA platform was invented in 2006, CUDA 13.1 is launching NVIDIA CUDA Tile. This exciting innovation introduces a virtual instruction set for tile-based parallel programming, focusing on the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores.

Why tile programming for GPUs?

CUDA exposes a single-instruction, multiple-thread (SIMT) hardware and programming model for developers. This requires (and enables) you to exhibit fine-grained control over how your code is executed with maximum flexibility and specificity. However, it can also require considerable effort to write code that performs well, especially across multiple GPU architectures.

There are many libraries to help developers extract performance, such as NVIDIA CUDA-X and NVIDIA CUTLASS. CUDA Tile introduces a new way to program GPUs at a higher level than SIMT.

With the evolution of computational workloads, especially in AI, tensors have become a fundamental data type. NVIDIA has developed specialized hardware to operate on tensors, such as NVIDIA Tensor Cores (TC) and NVIDIA Tensor Memory Accelerators (TMA), which are now integral to every new GPU architecture.

With more complex hardware, more software is needed to help harness these capabilities. CUDA Tile abstracts away tensor cores and their programming models so that code using CUDA Tile is compatible with current and future tensor core architectures.

Tile-based programming enables you to program your algorithm by specifying chunks of data, or tiles, and then defining the computations performed on those tiles. You don’t need to set how your algorithm is executed at an element-by-element level: the compiler and runtime will handle that for you.

Figure 1 shows the conceptual differences between the tile model we’re introducing with CUDA Tile, and the CUDA SIMT model.

The left side represents the tile model where the application (programmer) partitions the data into blocks and the compiler maps that data onto threads. This is contrasted with the thread level, or SIMT model, where the application maps the data to both blocks and threads. — *Figure 1. The tile model (left) partitions the data into blocks, and the compiler maps to threads. SIMT model (right) maps the data to both blocks and threads*

This programming paradigm is common in languages such as Python, where libraries like NumPy enable you to specify data types like matrices, then specify and execute bulk operations with simple code. Under the covers, the right things happen, and your computations continue completely transparent to you.

CUDA Tile IR: The foundation of tile programming

The foundation of CUDA Tile is CUDA Tile IR (intermediate representation). CUDA Tile IR introduces a virtual instruction set that enables native programming of the hardware as tile operations. Developers can write higher-level code that is efficiently executed across multiple generations of GPUs with minimal changes.

While NVIDIA Parallel Thread Execution (PTX) ensures portability for SIMT programs, CUDA Tile IR extends the CUDA platform with native support for tile-based programs. Developers focus on partitioning their data-parallel programs into tiles and tile blocks, letting CUDA Tile IR handle the mapping onto hardware resources such as threads, the memory hierarchy, and tensor cores.

By raising the level of abstraction, CUDA Tile IR enables users to build higher-level hardware-specific compilers, frameworks, and domain-specific languages (DSLs) for NVIDIA hardware. CUDA Tile IR for tile programming is analogous to PTX for SIMT programming.

One thing to point out is that it’s not an either/or situation. Tile programming on GPUs is another approach to writing GPU code, but you don’t have to choose between SIMT and tile programming; they coexist. When you need SIMT, you write your kernels as you always have. When you want to operate using tensor cores, you write tile kernels

Figure 2 shows a high-level diagram of how CUDA Tile fits into a representative software stack, and how the tile path exists as a separate but complementary path to the existing SIMT path.

A diagram of how the Tile path of compilation fits into the full software stack, adjacent to the SIMT path. The SIMT path includes NVVM/LVVM and PTX, whereas the tile path includes Tile IR. — Figure 2. The Tile path of compilation fits into the full software stack, adjacent to the SIMT path

How developers can use CUDA Tile to write GPU applications

CUDA Tile IR is one layer beneath where a vast majority of programmers will interface with tile programming. Unless you’re writing a compiler or library, you probably won’t need to concern yourself with the details of the CUDA Tile IR software.

NVIDIA cuTile Python: Most developers will interface with CUDA tile programming through software like NVIDIA cuTile Python—an NVIDIA Python implementation that uses CUDA Tile IR as the back end. We have a blog post that explains how to use cuTile-python with links to sample code and documentation.
CUDA Tile IR: For developers looking to build their own DSL compiler or library, CUDA Tile IR is where you’ll interface with CUDA Tile. The CUDA Tile IR documentation and specification include information on the CUDA Tile IR programming abstractions, syntax, and semantics. If you’re writing a tool/compiler/library that currently targets PTX, then you can adapt your software to also target CUDA Tile IR.

Video 1. Breaking down the core concepts of CUDA Tile

How to get the CUDA Tile software

CUDA Tile was launched with CUDA 13.1. All the information, including links to documentation, GitHub repos, and sample code, is on our CUDA Tile page.