Posts by Tejash Shah
Agentic AI / Generative AI
Feb 03, 2026
Accelerating Long-Context Model Training in JAX and XLA
Large language models (LLMs) are rapidly expanding their context windows, with recent models supporting sequences of 128K tokens, 256K tokens, and beyond....
9 MIN READ
Developer Tools & Techniques
Nov 13, 2025
Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL
CuTe, a core component of CUTLASS 3.x, provides a unified algebra for describing data layouts and thread mappings, and abstracts complex memory access patterns...
9 MIN READ
Data Center / Cloud
Jul 18, 2025
Optimizing for Low-Latency Communication in Inference Workloads with JAX and XLA
Running inference with large language models (LLMs) in production requires meeting stringent latency constraints. A critical stage in the process is LLM decode,...
6 MIN READ
Developer Tools & Techniques
Jul 16, 2025
CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design
GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and...
12 MIN READ
Agentic AI / Generative AI
Jul 16, 2025
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels
In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models...
12 MIN READ