NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things about CUDA Tile is that you can build your own DSL on top of it.
This post shares the work NVIDIA is doing to integrate CUDA Tile as a backend for OpenAI Triton, an open source Python DSL designed to write DL kernels for GPUs. OpenAI Triton supports tiled computation, a technique that divides data and computational tasks into small blocks. Triton contains an MLIR-based compiler that generates PTX. This enables researchers without CUDA experience to write efficient GPU code.
What are CUDA Tile and CUDA Tile IR?
CUDA Tile extends the CUDA programming model to enable first-class support for tile programming. Introduced in CUDA 13.1, CUDA Tile represents a paradigm shift in GPU programming. Rather than requiring developers to think in terms of individual threads through the SIMT model, the tile-based model allows computation to be expressed at a higher level of abstraction.
You simply specify operations on data blocks (tiles), while the compiler and runtime system automatically handle thread scheduling, hardware mapping, and resource allocation. This design simultaneously reduces programming complexity and enables more aggressive compiler optimizations.
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure. CUDA Tile development is driven by the CUDA Tile IR specification, which defines the formal semantics, operations, and type system for tile-based computations on NVIDIA GPUs.
What is Triton-to-TileIR?
Triton-to-TileIR backend is a bridge for Triton that enables it to target CUDA Tile IR (instead of PTX). It extends the Triton compiler ecosystem, enabling developers to compile and execute GPU kernels written in OpenAI Triton to the newly introduced CUDA Tile IR backend. This bridges high-level programming languages (Triton) with the NVIDIA next-generation GPU programming model, offering a seamless path to leverage modern hardware capabilities without requiring code rewrites.
As GPU programming continues to evolve beyond traditional SIMT models toward tile-based abstractions, this integration enables developers to benefit from Triton’s accessible Python syntax while gaining access to TileIR-native support for Tensor Cores and architectural portability.​​
Triton-to-TileIR democratizes access to these new capabilities. Notably, Triton itself is fundamentally a tile-based programming language—developers express computations in terms of blocks (tiles) of data rather than individual threads, which is conceptually aligned with the CUDA Tile IR.
This provides a direct backend compilation path: instead of compiling Triton’s tile-level abstractions down to thread-level SIMT code, Triton-to-TileIR preserves the tile-level semantics and compiles directly to CUDA Tile IR, which natively understands tile-granularity computation.
The existing Triton user community can leverage the advantages of CUDA Tile IR without learning new languages or rewriting existing code. A simple environment variable configuration switches the compilation pipeline from PTX backend to the CUDA Tile IR backend, unlocking improved performance and future-proof architecture compatibility.
Triton users will be able to select which backend (PTX backend or CUDA Tile IR backend) to use on a per-kernel basis in their applications.
Development road map of Triton-to-TileIR
As an incubator project within the triton-lang organization, Triton-to-TileIR is in active development. The repository serves as a collaboration space for implementing and refining the CUDA Tile IR backend before potential integration into the main Triton compiler.
Several technical workstreams likely comprise the development road map, including:
- Core conversion infrastructure: Implementing MLIR dialect conversion patterns to map Triton operations to CUDA Tile IR equivalents​
- Testing and validation: Developing comprehensive test suites to verify semantic correctness of the transformation, including edge cases in control flow, memory access patterns, and numerical precision
- Performance benchmarking: Establishing performance baselines comparing TileIR-compiled kernels against PTX-compiled equivalents across diverse operations (matrix multiplication, convolutions, element-wise operations, reductions, and so on)
- Open source project integration: Coordinating with the open source community to enable better support for the CUDA Tile IR backend in open source projects, such as Helion​
How to use Triton-to-TileIR
Triton-to-TileIR currently supports only source-based compilation. Prebuilt binaries are not available, requiring you to build the project from source in your local environment.
Prerequisites:
- CUDA version: CUDA 13.1 or higher
- GPU architecture: NVIDIA Blackwell GPUs (for example, GeForce RTX 5080); previous GPU architectures will be enabled in upcoming CUDA releases
Build from source
When prerequisites are satisfied, clone and build the project from source:
# Clone the repository
git clone https://github.com/triton-lang/Triton-to-tile-IR.git
cd Triton-to-tile-IR
# Build and install
# Specific build instructions should be followed according to the project's README
pip install -e .
Note that detailed build steps may vary. Consult the Triton-to-TileIR README and build documentation for architecture-specific configurations, dependency management, and troubleshooting guidance.
Verify Tile IR compilation
After building, verify the installation by running the vector addition tutorial and confirming that the Tile IR backend is being used:
# Navigate to the tutorial directory
cd python/tutorials
# Run the vector addition example with Tile IR enabled
export ENABLE_TILE=1
python 01-vector-add.py
When the Tile IR backend is active, Triton caches compiled kernels with .tileIR file extensions instead of the standard .cubin files used by the SIMT backend. Check for these cached files:
# Find the Triton cache directory (typically in ~/.triton/cache)
Limitations of Triton-to-TileIR
While Triton-to-TileIR opens promising new possibilities, the project remains in a relatively early development stage with several known constraints, including unsupported operations and temporary performance issues.
Unsupported operations
Not all Triton-supported operations are yet implemented in the Tile IR backend. Learn more about operations and features not yet supported or fully supported.
As CUDA continues to release new versions, the compatibility of the Triton CUDA Tile IR backend will continue to improve.
Tensor-of-pointer degradation suboptimal performance
The “tensor-of-pointer” pattern in Triton—where tensors are composed of pointers for describing memory access patterns—demonstrates suboptimal performance on the Tile IR backend with CUDA 13.1. This is a temporary performance situation. For impacted workloads, you can:
- Temporarily fall back to the SIMT backend for certain critical operations
- Await forthcoming optimization passes in future project releases
- Refine code to adopt the TMA load/store API
Regarding the last point, refining code to adopt the TMA load/store API: many of the tensors loaded in kernels have contiguous tiles and well‑defined shapes and strides. As a result, materializing a tensor-of-pointers inside the kernel is no longer necessary. Instead, this layout information can be passed to the TMA load/store API, allowing the Tile IR backend to get better performance.
For example, a typical tensor-of-pointers pattern may look like the following:
# Before: tensor-of-pointer style
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
a_ptrs = a_ptr + (offs_m[:, None] * stride_am
+ offs_k[None, :] * stride_ak)
b_ptrs = b_ptr + (offs_k[:, None] * stride_bk
+ offs_n[None, :] * stride_bn)
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
Here, every element in a_ptrs is an explicit pointer computed in the kernel, even though the tile itself is contiguous and its layout can be fully described by (shape, strides, block_shape).
With TMA, the same operation can be rewritten as:
desc_a = tl.make_tensor_descriptor(
a, # base pointer
shape=(M, K),
strides=(stride_am, stride_ak),
block_shape=(BLOCK_M, BLOCK_K) # tile size
)
desc_b = tl.make_tensor_descriptor(
b, shape=(K, N),
strides=(stride_bk, stride_bn),
block_shape=(BLOCK_K, BLOCK_N)
)
offs_m = pid_m * BLOCK_M
offs_n = pid_n * BLOCK_N
a_tile = desc_a.load([offs_m, 0]) # [BLOCK_M, BLOCK_K]
b_tile = desc_b.load([0, offs_n]) # [BLOCK_K, BLOCK_N]
desc_c.store([offs_m, offs_n], acc) # TMA-backed store
Learn more about Triton-to-TileIR
The Triton-to-TileIR project represents a significant step in the evolution of GPU programming, bridging the gap between developer productivity and hardware efficiency. By enabling Triton’s accessible, tile-oriented programming model to target CUDA Tile IR virtual instruction set, the integration promises to deliver performance, portability, and future-readiness for machine learning practitioners and GPU developers.
For developers already using Triton, the TileIR backend will offer a pathway to leverage next-generation GPU architectures with minimal code changes. For the broader GPU programming ecosystem, this collaboration demonstrates how strategic partnerships between language designers and hardware vendors can create compounding benefits—making advanced hardware capabilities accessible without sacrificing the high-level abstractions that enable rapid innovation.
As the project matures and moves from incubation to production readiness, it will be fascinating to observe how the integration influences both Triton’s adoption and the broader trajectory of tile-based GPU programming. The ultimate success metric will be simple: can researchers with limited GPU expertise write Triton code that executes with near-optimal performance on NVIDIA GPUs.
To learn more, check out the triton-lang/Triton-to-tile-IR GitHub repo and Performance Tuning Tips for CUDA Tile IR Backend.