NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations—loads, stores, and matrix multiply-accumulate—rather than manually coordinating threads, warps, and shared memory.
cuTile.jl brings the same tile-based approach to the dynamic programming language Julia. Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in Julia’s scientific computing ecosystem— spanning differential equations, probabilistic programming, and physics simulations.Â
cuTile Python has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch.
This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to cuTile.jl (Julia). It shows how to:
- Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side.
- Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs—and silent mismatches produce wrong results, not compiler errors.
- Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort.
Cross-DSL GPU kernel translation
Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1.
| Category | Python (cuTile) | Julia (cuTile.jl) |
| Indexing | 0-based (ct.bid(0)) | 1-based (ct.bid(1)) |
| Broadcasting | Implicit (a + b) | Explicit dot syntax (a .+ b) |
| Memory layout | Row-major | Column-major |
| Kernel definition | @ct.kernel decorator | Plain function ... end |
| Constants | param: ct.Constant[int] in signature | param::Int in signature, ct.Constant(val) at launch |
| Type conversion | tile.astype(ct.float32) | convert(ct.Tile{Float32}, tile) |
| Matrix multiply | ct.mma(a, b, acc=acc) | muladd(a, b, acc) |
None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1), and you get silent data corruption. Use * instead of .* for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours.
A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow—if the model is taught what to watch out for.
Translating cuTile Python to cuTile.jl
The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject.
Matrix multiplication example
The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout.
cuTile Python:
@ct.kernel
def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int],
tk: ct.Constant[int]):
bid_m = ct.bid(0)
bid_n = ct.bid(1)
num_k = ct.num_tiles(A, axis=1, shape=(tm, tk))
acc = ct.full((tm, tn), 0, dtype=ct.float32)
dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype
for k in range(num_k):
a = ct.load(A, index=(bid_m, k), shape=(tm, tk),
padding_mode=ct.PaddingMode.ZERO)
b = ct.load(B, index=(k, bid_n), shape=(tk, tn),
padding_mode=ct.PaddingMode.ZERO)
a = a.astype(dtype)
b = b.astype(dtype)
acc = ct.mma(a, b, acc)
acc = ct.astype(acc, C.dtype)
ct.store(C, index=(bid_m, bid_n), tile=acc)
cuTile.jl (Julia):
function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},
tm::Int, tn::Int, tk::Int) where {T}
bid_m = ct.bid(1)
bid_n = ct.bid(2)
num_k = ct.num_tiles(A, 2, (tm, tk))
acc = zeros(Float32, tm, tn)
U = T === Float32 ? ct.TFloat32 : T
for k in Int32(1):num_k
a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero)
b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero)
a = convert(ct.Tile{U}, a)
b = convert(ct.Tile{U}, b)
acc = muladd(a, b, acc)
end
acc = convert(ct.Tile{T}, acc)
ct.store(C; index=(bid_m, bid_n), tile=acc)
return
end
Beyond the basic syntax changes, note the following:
- The layout flips: The Python row-major
A(M,K)becomes column-majorA_jl(K,M)in Julia. The accumulator, load indices, and store indices all change accordingly. Get the accumulator shape wrong—say(TM, TN)instead of(TN, TM)—and you get wrong results with no compiler warning. ct.mma→muladd:cuTile.jl maps matrix multiply-accumulate to the Julia standardmuladd, andct.PaddingMode.ZERObecomesct.PaddingMode.Zero(PascalCase).
Softmax example
Softmax pushes things further. Three strategies were implemented in Julia—tensor memory accelerator (TMA) single-tile, online, and chunked—to handle different tensor sizes. On top of the matmul patterns, the softmax function brings in broadcast dot syntax (ct.exp(ct.sub(a, b)) → exp.(a .- b)), renamed reductions (ct.max → maximum, ct.sum → sum, axis +1), and element-wise ct.maximum(a, b) → max.(a, b).
But the real challenge isn’t syntax—it’s maintaining correct running max/sum statistics through the translation.
Workflow generation with agent skills
The primary outcome of this project wasn’t the translated kernels—it was the skill built to produce them.

A skill, in this context, is a directory of structured knowledge that lives in the repository and is picked up by an LLM agent. The path to this particular skill is:.claude/skills/converting-cutile-to-julia/.
.claude/skills/converting-cutile-to-julia/
├── SKILL.md # Entry point: workflow overview, top pitfalls
├── translations/
│ └── workflow.md # Step-by-step conversion with checklists
├── references/
│ ├── api-mapping.md # Bidirectional Python↔Julia API table
│ ├── critical-rules.md # 17 rules (indexing, broadcasting, loops, ...)
│ ├── debugging.md # Error diagnosis for MethodError, IRError, etc.
│ └── testing.md # Test patterns, tolerances per dtype
├── scripts/
│ └── validate_cutile_jl.py # Static checker for common anti-patterns
└── examples/
├── 01_add/ # Python→Julia for vector addition
├── 02_matmul/ # Python→Julia for matrix multiply
└── 03_softmax/ # Python→Julia for softmax (3 strategies)
The critical-rules.md alone captures 17 pitfalls the team encountered. Table 2 details the most common pitfalls and the associated fixes.
| # | Pitfall | Fix |
| 1 | max(a, b) on tiles → IRError | Use max.(a, b) (broadcast dot) |
| 2 | ct.load with order — index positions wrong | order remaps BOTH shape AND index |
There’s also a static validator script that catches things like leftover ct.bid(0), for loops inside kernels, and Python-style type names—before running on the GPU. With all of this in place, the model doesn’t have to rediscover the conversion rules each time. It reads the skill, follows the checklist, and applies the rules.
The AI agent skill in TileGym
The concrete deliverable is a Julia subproject under julia/ in TileGym, which is open source:
julia/
├── Project.toml # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test
├── kernels/
│ ├── add.jl # 1D element-wise with alpha scaling
│ ├── matmul.jl # 2D tiled MMA with column-major layout
│ └── softmax.jl # 3 strategies: TMA, online, chunked
└── test/
├── runtests.jl # Test runner
├── test_add.jl
├── test_matmul.jl
└── test_softmax.jl
These three kernels were deliberately selected. Kernel add is the simplest method to test the full translation surface. Matmul adds loop structure, tensor cores, and the layout flip. Softmax introduces multipass algorithms with invariants that have to survive translation. Each kernel has tests that compare against a CPU reference with per-dtype tolerances, including boundary cases where dimensions don’t align to tile sizes.
Results and lessons learned
With the skill in place, the workflow for each kernel looked like this:
- Pre-flight: Scan the source for patterns that require special handling (
forloops,ct.mma,order=, and so on). - Convert: Apply the API mapping and critical rules.
- Validate: Run the static checker.
- Test: Run Julia tests against reference implementations.
- Fix: If something fails, use the debugging guide, fix, and rerun.
For a representative general matrix multiply (GEMM) conversion, the process took about 4 minutes and ~78K tokens on a frontier LLM with no manual intervention. Subsequent kernels were faster because the examples and rules were already in the repo.
Table 3 lists the pitfalls that caused bugs during ports, all of which are now handled automatically in the skills.
| Pitfall | Symptom | Root cause |
ct.bid(0) left unchanged | Wrong tile loaded, silent corruption | 0-based versus 1-based indexing |
a * b for element-wise multiply | Matrix multiply instead of element-wise | Julia * is matmul; need .* |
Accumulator shape (TM, TN) | Wrong results in matmul | Column-major needs (TN, TM) |
ct.PaddingMode.ZERO | UndefVarError | Julia uses PascalCase: .Zero |
The takeaway isn’t that AI wrote the code. It’s the ability to capture what was learned into something the model can reuse next time. A prompt can say, “Be careful with indexing.” A skill can say, “Here are the 17 specific things that go wrong, here’s how to check for them, and here’s a script that catches them automatically.”
Now, future ports can start from a repo that already has working examples, a tested API mapping, a static validator, and a debugging guide. Each one takes less effort than the last.Â
A broader takeaway is that the challenge in using AI for systems work isn’t code generation—it’s producing correct code in domains where the compiler won’t catch semantic mistakes. Encoding domain rules in version control, alongside the code they describe, is one way to address this.
Get started using agent skills to translate Python kernels to Julia
Use the following code to try the Julia subproject and the conversion skill:
cd TileGym
# Explore the Julia kernels
ls julia/kernels/ # add.jl, matmul.jl, softmax.jl
# Explore the conversion skill
ls .claude/skills/converting-cutile-to-julia/
# Install Julia dependencies (requires Julia 1.12+, CUDA 13.1+ driver)
julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'
# Run the Julia kernel tests
julia --project=julia/ julia/test/runtests.jl
Requirements:
- Julia 1.12+ and NVIDIA CUDA 13.1+ driver
- NVIDIA Ampere, NVIDIA Ada, or NVIDIA Blackwell GPU (compute capability 8.x, 10.x, 11.x, 12.x)
- An LLM agent with file system access (for example, Claude Code). To use the conversion skill for your own kernels, point your LLM agent at .claude/skills/converting-cutile-to-julia/SKILL.md, provide a cuTile Python kernel as input, and start translating Python kernels to Julia.