Developer Tools & Techniques

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl

A person working on code on their computer.

Apr 30, 2026

By Zhengyi Zhang, Yifei Song and Tim Besard

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA CUDA Tile (cuTile) enables tile-based GPU kernel programming, and cuTile.jl brings this abstraction to Julia, allowing custom GPU kernels without using CUDA C++, critical for Julia's scientific computing ecosystem.
Translating GPU kernels from cuTile Python to cuTile.jl involves handling key semantic differences such as 0-based vs. 1-based indexing, row-major vs. column-major memory layout, broadcasting syntax, and kernel API mappings, which if mishandled, cause silent errors.
The TileGym project developed an AI-driven skill-based workflow that encodes 17 critical translation rules, static validation scripts, and example kernels (add, matmul, softmax), enabling automated, repeatable, and validated conversion of cuTile Python kernels to Julia with minimal manual effort.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory.

cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in Julia’s scientific computing ecosystem— spanning differential equations, probabilistic programming, and physics simulations.

cuTile Python has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch.

This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to cuTile.jl (Julia). It shows how to:

Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side.
Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs—and silent mismatches produce wrong results, not compiler errors.
Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort.

Cross-DSL GPU kernel translation

Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1.

Category	Python (cuTile)	Julia (cuTile.jl)
Indexing	0-based (`ct.bid(0)`)	1-based (`ct.bid(1)`)
Broadcasting	Implicit (`a + b`)	Explicit dot syntax (`a .+ b`)
Memory layout	Row-major	Column-major
Kernel definition	`@ct.kernel` decorator	Plain `function ... end`
Constants	`param: ct.Constant[int]` in signature	`param::Int` in signature, `ct.Constant(val)` at launch
Type conversion	`tile.astype(ct.float32)`	`convert(ct.Tile{Float32}, tile)`
Matrix multiply	`ct.mma(a, b, acc=acc)`	`muladd(a, b, acc)`

Table 1. High-level differences between writing tile code in Python versus Julia

None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1), and you get silent data corruption. Use * instead of .* for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours.

A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow—if the model is taught what to watch out for.

Translating cuTile Python to cuTile.jl

The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject.

Matrix multiplication example

The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout.

cuTile Python:

@ct.kernel
def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int],
                  tk: ct.Constant[int]):
    bid_m = ct.bid(0)
    bid_n = ct.bid(1)

    num_k = ct.num_tiles(A, axis=1, shape=(tm, tk))
    acc = ct.full((tm, tn), 0, dtype=ct.float32)

    dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype

    for k in range(num_k):
        a = ct.load(A, index=(bid_m, k), shape=(tm, tk),
                    padding_mode=ct.PaddingMode.ZERO)
        b = ct.load(B, index=(k, bid_n), shape=(tk, tn),
                    padding_mode=ct.PaddingMode.ZERO)
        a = a.astype(dtype)
        b = b.astype(dtype)
        acc = ct.mma(a, b, acc)

    acc = ct.astype(acc, C.dtype)
    ct.store(C, index=(bid_m, bid_n), tile=acc)

cuTile.jl (Julia):

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},
                      tm::Int, tn::Int, tk::Int) where {T}
    bid_m = ct.bid(1)
    bid_n = ct.bid(2)

    num_k = ct.num_tiles(A, 2, (tm, tk))
    acc = zeros(Float32, tm, tn)

    U = T === Float32 ? ct.TFloat32 : T

    for k in Int32(1):num_k
        a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero)
        b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero)
        a = convert(ct.Tile{U}, a)
        b = convert(ct.Tile{U}, b)
        acc = muladd(a, b, acc)
    end

    acc = convert(ct.Tile{T}, acc)
    ct.store(C; index=(bid_m, bid_n), tile=acc)
    return
end

Beyond the basic syntax changes, note the following:

The layout flips: The Python row-major A(M,K) becomes column-major A_jl(K,M) in Julia. The accumulator, load indices, and store indices all change accordingly. Get the accumulator shape wrong—say (TM, TN) instead of (TN, TM)—and you get wrong results with no compiler warning.
ct.mma → muladd: cuTile.jl maps matrix multiply-accumulate to the Julia standard muladd, and ct.PaddingMode.ZERO becomes ct.PaddingMode.Zero (PascalCase).

Softmax example

Softmax pushes things further. Three strategies were implemented in Julia—tensor memory accelerator (TMA) single-tile, online, and chunked—to handle different tensor sizes. On top of the matmul patterns, the softmax function brings in broadcast dot syntax (ct.exp(ct.sub(a, b)) → exp.(a .- b)), renamed reductions (ct.max → maximum, ct.sum → sum, axis +1), and element-wise ct.maximum(a, b) → max.(a, b).

But the real challenge isn’t syntax—it’s maintaining correct running max/sum statistics through the translation.

Workflow generation with agent skills

The primary outcome of this project wasn’t the translated kernels—it was the skill built to produce them.

A skill, in this context, is a directory of structured knowledge that lives in the repository and is picked up by an LLM agent. The path to this particular skill is:.claude/skills/converting-cutile-to-julia/.

.claude/skills/converting-cutile-to-julia/
├── SKILL.md                           # Entry point: workflow overview, top pitfalls
├── translations/
│   └── workflow.md                    # Step-by-step conversion with checklists
├── references/
│   ├── api-mapping.md                 # Bidirectional Python↔Julia API table
│   ├── critical-rules.md              # 17 rules (indexing, broadcasting, loops, ...)
│   ├── debugging.md                   # Error diagnosis for MethodError, IRError, etc.
│   └── testing.md                     # Test patterns, tolerances per dtype
├── scripts/
│   └── validate_cutile_jl.py          # Static checker for common anti-patterns
└── examples/
    ├── 01_add/                        # Python→Julia for vector addition
    ├── 02_matmul/                     # Python→Julia for matrix multiply
    └── 03_softmax/                    # Python→Julia for softmax (3 strategies)

The critical-rules.md alone captures 17 pitfalls the team encountered. Table 2 details the most common pitfalls and the associated fixes.

#	Pitfall	Fix
1	`max(a, b)` on tiles → `IRError`	Use `max.(a, b)` (broadcast dot)
2	`ct.load` with `order` — index positions wrong	`order` remaps BOTH shape AND index

Table 2. Pitfalls and associated fixes for some of the more common issues encountered

There’s also a static validator script that catches things like leftover ct.bid(0), for loops inside kernels, and Python-style type names—before running on the GPU. With all of this in place, the model doesn’t have to rediscover the conversion rules each time. It reads the skill, follows the checklist, and applies the rules.

The AI agent skill in TileGym

The concrete deliverable is a Julia subproject under julia/ in TileGym, which is open source:

julia/
├── Project.toml                # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test
├── kernels/
│   ├── add.jl                  # 1D element-wise with alpha scaling
│   ├── matmul.jl               # 2D tiled MMA with column-major layout
│   └── softmax.jl              # 3 strategies: TMA, online, chunked
└── test/
    ├── runtests.jl             # Test runner
    ├── test_add.jl
    ├── test_matmul.jl
    └── test_softmax.jl

These three kernels were deliberately selected. Kernel add is the simplest method to test the full translation surface. Matmul adds loop structure, tensor cores, and the layout flip. Softmax introduces multipass algorithms with invariants that have to survive translation. Each kernel has tests that compare against a CPU reference with per-dtype tolerances, including boundary cases where dimensions don’t align to tile sizes.

Results and lessons learned

With the skill in place, the workflow for each kernel looked like this:

Pre-flight: Scan the source for patterns that require special handling (for loops, ct.mma, order=, and so on).
Convert: Apply the API mapping and critical rules.
Validate: Run the static checker.
Test: Run Julia tests against reference implementations.
Fix: If something fails, use the debugging guide, fix, and rerun.

For a representative general matrix multiply (GEMM) conversion, the process took about 4 minutes and ~78K tokens on a frontier LLM with no manual intervention. Subsequent kernels were faster because the examples and rules were already in the repo.

Table 3 lists the pitfalls that caused bugs during ports, all of which are now handled automatically in the skills.

Pitfall	Symptom	Root cause
`ct.bid(0)` left unchanged	Wrong tile loaded, silent corruption	0-based versus 1-based indexing
`a * b` for element-wise multiply	Matrix multiply instead of element-wise	Julia `` is matmul; need `.`
Accumulator shape `(TM, TN)`	Wrong results in matmul	Column-major needs `(TN, TM)`
`ct.PaddingMode.ZERO`	`UndefVarError`	Julia uses PascalCase: `.Zero`

Table 3. Common pitfalls, symptoms, and root causes that cause bugs during the porting of tile code from Python to Julia

The takeaway isn’t that AI wrote the code. It’s the ability to capture what was learned into something the model can reuse next time. A prompt can say, “Be careful with indexing.” A skill can say, “Here are the 17 specific things that go wrong, here’s how to check for them, and here’s a script that catches them automatically.”

Now, future ports can start from a repo that already has working examples, a tested API mapping, a static validator, and a debugging guide. Each one takes less effort than the last.

A broader takeaway is that the challenge in using AI for systems work isn’t code generation—it’s producing correct code in domains where the compiler won’t catch semantic mistakes. Encoding domain rules in version control, alongside the code they describe, is one way to address this.

Get started using agent skills to translate Python kernels to Julia

Use the following code to try the Julia subproject and the conversion skill:

cd TileGym

# Explore the Julia kernels
ls julia/kernels/     # add.jl, matmul.jl, softmax.jl

# Explore the conversion skill
ls .claude/skills/converting-cutile-to-julia/

# Install Julia dependencies (requires Julia 1.12+, CUDA 13.1+ driver)
julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'

# Run the Julia kernel tests
julia --project=julia/ julia/test/runtests.jl

Requirements:

Julia 1.12+ and NVIDIA CUDA 13.1+ driver
NVIDIA Ampere, NVIDIA Ada, or NVIDIA Blackwell GPU (compute capability 8.x, 10.x, 11.x, 12.x)
An LLM agent with file system access (for example, Claude Code). To use the conversion skill for your own kernels, point your LLM agent at .claude/skills/converting-cutile-to-julia/SKILL.md, provide a cuTile Python kernel as input, and start translating Python kernels to Julia.

Discuss (0)

About the Authors

About Zhengyi Zhang
Zhengyi Zhang is a computer architect intern at NVIDIA. He is currently a PhD candidate at Fudan University. Zhengyi's research interests span deep learning inference optimization, high-performance kernel development, and compiler techniques for deep learning workloads.

View all posts by Zhengyi Zhang

About Yifei Song
Yifei Song is a computer architect at NVIDIA. He graduated from the University of Chinese Academy of Sciences. Yifei focuses on end-to-end training optimization, distributed model parallelism, and MLIR compiler infrastructure for deep learning systems.

View all posts by Yifei Song

About Tim Besard
Tim Besard is a software engineer at JuliaHub, where he leads GPU support and development for the Julia programming language. He holds a Ph.D. in computer science engineering from Ghent University, Belgium, and has been a key contributor to Julia's GPU ecosystem since 2014. Tim maintains several foundational GPU packages, including CUDA.jl, GPUArrays.jl, GPUCompiler.jl, and LLVM.jl, which together form the backbone of GPU computing in Julia.

View all posts by Tim Besard