Agentic AI / Generative AI

Next Generation of FlashAttention

Jul 11, 2024

By Vijay Thakkar and Fred Oh

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA collaborated with Colfax, Together.ai, Meta, and Princeton University to accelerate Fused Attention kernels using CUTLASS 3 on the Hopper GPU architecture and Tensor Cores.
FlashAttention-3 achieved 1.5-2.0x faster performance than FlashAttention-2 with FP16, reaching up to 740 TFLOPS, and up to 1.2 PFLOPS with FP8.
CUTLASS is an open-source CUDA library that enables deep learning and HPC practitioners to achieve high performance on NVIDIA Tensor Core GPUs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3.

FlashAttention-3 incorporates key techniques to achieve 1.5–2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8, FlashAttention-3 reaches up to 1.2 PFLOPS, with 2.6x smaller errors than baseline FP8 attention.

CUTLASS is an open-source CUDA library intended to enable deep learning and HPC practitioners to achieve speed-of-light performance on NVIDIA Tensor Core GPUs for custom algorithms and research and production workloads alike.

For more information about the collaboration, see the FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision post and research paper.

Discuss (0)

About the Authors

About Vijay Thakkar
Vijay Thakkar is a senior compute architect at NVIDIA and the primary author of CUTLASS 3. In addition to his work on CUTLASS, he is involved in the development of Tensor Core architecture, PTX exposure, and programming model across the GPU architecture, compiler, and CUDA engineering teams.

View all posts by Vijay Thakkar

About Fred Oh
Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors.

View all posts by Fred Oh

Next Generation of FlashAttention

Tags

About the Authors

Comments