cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia

NVIDIA CUDA Tile is one of the most significant additions to NVIDIA CUDA programming and unlocks automatic access to tensor cores and other specialized hardware. Earlier this year, NVIDIA released cuTile for Python, giving Python developers a natural way to write high-performance GPU kernels.

Now, the same programming model is available in Julia through cuTile.jl. In this blog post, we’ll explore how cuTile.jl simplifies the development of high-performance CUDA kernels, demonstrate its idiomatic Julia syntax, and discuss its performance parity with the existing cuTile Python implementation.

What is tile-based GPU programming?

Traditional GPU programming with CUDA requires developers to think about threads, warps, and memory hierarchies. While powerful, this approach requires the programmer to map algorithms onto hardware efficiently. With CUDA Tile, developers describe operations on tiles of data, and the compiler handles the mapping to hardware.

Consider vector addition. In the traditional GPU programming model, using CUDA.jl, the programmer must manage individual threads explicitly:

using CUDA

function vadd(a, b, c, n)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds c[i] = a[i] + b[i]
    end
    return
end

threads = 512
blocks = cld(vector_size, threads)
@cuda threads blocks vadd(a, b, c, vector_size)

With CUDA Tile through cuTile.jl, the same operations are now expressed at the tile level, hiding details like index calculations or out-of-bounds checks:

import cuTile as ct

function vadd(a, b, c, tile_size)
    pid = ct.bid(1)
    tile_a = ct.load(a, pid, (tile_size,))
    tile_b = ct.load(b, pid, (tile_size,))
    ct.store(c, pid, tile_a + tile_b)
    return
end

tile_size = 1024
grid = cld(vector_size, tile_size)
ct.launch(vadd, grid, a, b, c, ct.Constant(tile_size))

Compare this with the Python equivalent:

@ct.kernel
def vadd(a, b, c, tile_size: ct.Constant[int]):
    pid = ct.bid(0)
    tile_a = ct.load(a, index=(pid,), shape=(tile_size,))
    tile_b = ct.load(b, index=(pid,), shape=(tile_size,))
    ct.store(c, index=(pid,), tile=tile_a + tile_b)

tile_size = 1024
grid = ceil(vector_size / tile_size)
ct.launch(stream, grid, vadd, (a, b, c, tile_size))

The two are strikingly similar, and this is deliberate. cuTile.jl keeps the abstraction level of kernels identical to those written in cuTile Python, making it easy to port code over or learn from the cuTile Python documentation. At the same time, it uses Julia idioms wherever possible to make the package intuitive for Julia programmers, including 1-based indexing and broadcast expressions for element-wise operations.

Idiomatic Julia kernels

Where this really shines is in kernels that go beyond simple loads and stores. The following is a row-normalization kernel—the core of layer normalization, without the weights and bias:

function normalize_rows(X, Y, tile_n)
    bid = ct.bid(1)
    tile = ct.load(X, (bid, 1), (1, tile_n))
    mean = sum(tile; dims=2) / size(X, 2)
    centered = tile .- mean
    var = sum(centered .^ 2.0f0; dims=2) / size(X, 2)
    ct.store(Y, (bid, 1), centered ./ sqrt.(var .+ 1f-5))
    return
end

In this example, sum, size, and sqrt are standard Julia functions augmented to work on tiles. The dots (.^, .-, ./) are standard Julia broadcasting syntax, showing the operation is applied element-wise. The kernel reads like regular Julia array code. The closer cuTile.jl kernels are to ordinary Julia, the easier it is to share and reuse code between the CPU and GPU.

Performance of cuTile.jl

cuTile.jl targets the same NVIDIA Tile IR backend as cuTile Python, so both packages produce the same kind of GPU machine code. On an NVIDIA GeForce RTX 5080 (compute capability 12.0, NVIDIA Blackwell architecture), compute-intensive kernels achieve performance parity with the Python implementation:

Kernel	cuTile.jl	cuTile Python	cuTile.jl compared to cuTile Python
Vector addition	838 GB/s	843 GB/s	99%
Matrix transpose	797 GB/s	812 GB/s	98%
Matrix multiplication	50.9 TFLOPS	50.5 TFLOPS	100%
Batch matrix multiply	43.0 TFLOPS	47.5 TFLOPS	91%

Table 1. Performance comparison of common GPU kernels when using Julia or Python as the front-end

Some kernels with more complex control flow, such as layer normalization or FFT, don’t reach full performance parity, as the cuTile.jl compiler is still maturing. These are tracked as known issues and are actively being worked on.

How cuTile.jl works

cuTile.jl uses a custom Julia compiler that intercepts standard library calls such as +, sum, reshape, and routes them to Tile IR operations. The resulting IR is then lowered to Tile IR bytecode, the same binary format that cuTile Python produces. From there, the NVIDIA tileiras compiler handles the final compilation to GPU machine code.

The generated Tile IR can be inspected for any kernel:

julia> ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16))
cuda_tile.module @kernels {
  entry @vadd(%arg0: tile<ptr<f32>>, %arg1: tile<i32>, ...) {
    ...
    return
  }
}

This transparency is valuable for debugging and for understanding how high-level Julia code maps to tile operations.

Current status of cuTile.jl

cuTile.jl is an experimental, open-source package under active development at JuliaGPU/cuTile.jl. It supports a broad set of tile operations such as memory access, arithmetic, reductions, scans, matrix multiply, shape manipulation, and atomics. It also includes working examples for vector addition, matrix multiplication, transpose, batch matrix multiply, layer normalization, and FFT.

That said, this is early-stage software, and:

Not all cuTile features are implemented.
Some Julia language features (notably iterator-based ‘for’ loops) aren’t supported in kernels or generate inefficient code
The integration with CUDA.jl needs to improve to facilitate coexistence with SIMT kernels.
APIs may change without notice.

The project builds on Julia’s existing GPU ecosystem, integrating with CUDA.jl for array management and kernel launching. Users who are already writing GPU code in Julia with CUDA.jl will find the transition to tile-based programming straightforward.

Getting started

Just like cuTile Python, cuTile.jl requires an NVIDIA Ada, NVIDIA Ampere or NVIDIA Blackwell GPU and an NVIDIA driver for CUDA 13.1 or higher. The package also requires Julia 1.11 or higher.

Launch Julia, and press `]` from the REPL to enter the integrated package manager to install cuTile.jl:

pkg> add cuTile

pkg> # if you want, run the test suite
     test cuTile

The GitHub contains a full list of supported operations and detailed documentation on how cuTile.jl differs from both cuTile Python and standard Julia.