Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200

Computer-aided engineering (CAE) forms the backbone for modern product development across industries, from designing safer aircraft to optimizing renewable energy systems. That means computational speed and accuracy are critical for engineering decisions, necessitating rapid prototyping and innovation. However, the barrier to entry for the current CAE ecosystem and solvers has been high.

Traditional CAE applications have relied on low-level languages like C++ and Fortran to meet their demanding throughput and scalability requirements. Python, meanwhile, has emerged as the de facto language for AI/ML development. It historically hasn’t been used for large-scale CFD applications, though, because of performance limitations linked to its high-level and interpreted nature. The rise of physics-based machine learning has created a growing demand for Python-based CAE solvers that seamlessly integrate with the AI/ML ecosystem yet maintain the performance of low-level languages.

Autodesk Research developed the Accelerated Lattice Boltzmann (XLB) library to bridge this gap between CAE solvers and the AI/ML ecosystem. It’s a performant, open source computational fluid dynamics (CFD) solver based on the Lattice Boltzmann Method (LBM), being used by the Autodesk Research team for ongoing research exploration and experimentation. The Python-native implementation makes it highly accessible to developers. And its differentiable architecture enables natural integration with modern AI-physics modeling frameworks, a rapidly growing sub-field within the CAE and scientific computing domain.

By leveraging NVIDIA Warp in conjunction with the GH200 Grace Hopper Superchip, XLB achieved an approximate ~8x speedup compared to its GPU-accelerated JAX backend on specific hardware configurations and benchmarking cases defined by the Autodesk Research team. Utilizing an out-of-core computation strategy, Autodesk Research was also able to scale XLB’s Warp backend solver to about 50 billion computational cells.

NVIDIA Warp, an open source Python framework for high-performance simulation and spatial computing, combines Python’s accessibility with CUDA’s high-performance computing (HPC) capabilities. At the same time, the GH200 Superchip addresses a core CAE requirement: running high-fidelity simulations at maximum throughput and scale.

XLB by Autodesk Research: scaling CFD purely in Python

While Python-implemented CFD codes are traditionally viewed as performance-compromised, XLB proves Warp can deliver transformative CFD workflow performance in practice. Figure 1 below compares the performance, measured in million lattice updates per second (MLUPS), between OpenCL-based FluidX3D and XLB’s Warp backend solver for a 512³ lid-driven cavity flow simulation. These numbers are based on publicly available data and internal benchmarking that was conducted by the Autodesk Research team. The results demonstrate that the Warp-accelerated XLB Python code achieves performance comparable (about 95% similar) to the OpenCL-based FluidX3D solver implemented in C++ for the lid-driven cavity flow simulation.

At the same time, Warp offers excellent readability and rapid prototyping capabilities through its Python interface, in contrast to the C++ and OpenCL-based backend of the FluidX3D code.

A bar chart shows the comparison between XLB’s Warp backend and FluidX3D’s OpenCL backend on a GH200 Grace Hopper node. — *Figure 1. Comparison between XLB’s Warp backend LBM solver and FluidX3D’s OpenCL backend solver on a GH200 Grace Hopper node.*

This performance equivalence addresses a longstanding challenge in CFD research. Traditionally, researchers faced a tradeoff between development productivity and computational performance, choosing between Python’s accessibility or the efficiency of optimized implementations in languages like C++ or Fortran. By utilizing Warp, XLB enables researchers to leverage Python’s ecosystem of numerical libraries, visualization tools, and machine-learning frameworks while maintaining high-throughput performance.

Two plots with bar charts show the scaling of XLB CFD solver with Warp backend on GH200 Grace Hopper Superchip. The left plot shows maximum domain size fitted as the number of nodes increases, while the right plot shows performance/speed-up with the number of nodes. — *Figure 2. Multi-node scaling of XLB on GH200 Grace Hopper Superchip, with the maximum domain size (left) and computational throughput in MLUPS (right).*

Warp can also take advantage of the GH200 Grace Hopper Superchip architecture. During a joint collaboration between Autodesk Research and NVIDIA, the XLB team scaled XLB’s Warp backend solver to a multi-node configuration that can perform CFD simulations up to approximately 50 billion computational elements.

The team implemented an out-of-core computation strategy where the computational domain and associated flow variables reside primarily in CPU memory and are systematically transferred to the GPUs for processing. This was enabled by the GH200’s NVLink-C2C interconnect, with its 900 GB/s CPU-GPU bandwidth. It makes out-of-core computation strategies highly practical by enabling rapid data streaming as computational tiles are swapped in and out of GPU memory. The NVLink-C2C memory coherency supports seamless data transfers for out-of-core approaches, eliminating traditional CPU-GPU bottlenecks in large-scale simulations.

Figure 2 above quantitatively demonstrates near-linear scaling with increasing node count for both maximum simulation size (left) and computational throughput (right) for XLB’s Warp backend solver. An eight-node GH200 cluster enabled simulations with approximately 50 billion lattice cells while achieving ~8x speedup compared to a single-node GH200 system.

Video 1. Large eddy simulation of flow past New York City using Autodesk Research XLB accelerated by NVIDIA Warp.

This achievement marks a turning point where Python-native CFD is no longer a compromise but an innovation advantage.

“XLB, powered by NVIDIA Warp, helps researchers rapidly prototype and test new ideas without being slowed down by performance bottlenecks,” said Mehdi Ataei, principal AI research scientist at Autodesk Research. “This agility has led to the development of multiple research prototypes we are in the process of publishing.”

NVIDIA Warp: write solvers at warp speed

In this section, we explore some key features that make Warp uniquely suited for developing scalable CAE simulation tools.

A diagram that shows key features of NVIDIA Warp and how it bridges the gap between CUDA and Python. — *Figure 3. NVIDIA Warp bridges the gap between CUDA and Python and is designed specifically for simulation developers by the NVIDIA simulation technology team*.

NVIDIA Warp provides a powerful bridge between CUDA and Python for simulation developers (Figure 3). It enables developers to write GPU kernels directly in Python that are just-in-time (JIT)-compiled to native CUDA code. Warp offers rich simulation capabilities, such as warp.fem for finite element analysis, distinguishing it from existing Python-CUDA libraries like Numba. Importantly, Warp kernels are differentiable by design, enabling seamless integration with deep learning frameworks like PyTorch and JAX. Warp also maintains interoperability with numerous existing frameworks, including NumPy, CuPy, and JAX, allowing users to leverage the strengths of each respective framework.

Two plots with bar charts showing comparisons between Warp and JAX backend for XLB. The left plot shows performance comparison in MLUPS while the right plot shows memory usage by Warp and JAX as domain size is increased on a single GPU. The solvers are benchmarked on a single A100 GPU for the lid-driven cavity flow. — *Figure 4. Performance comparison between* *XLB’s JAX and Warp backends for lid-driven cavity flow simulation in XLB.*

Figure 4 above compares computational throughput (in MLUPS) and memory consumption between XLB’s Warp and JAX backends. For the specific hardware configurations employed by the Autodesk Research team and the lid-driven cavity flow benchmark, Warp outperformed JAX in terms of throughput and memory usage (Figure 4). XLB’s Warp backend solver delivered an ~8x speedup over the JAX backend solver on a single A100 GPU while achieving two- to three-times better memory efficiency on the same GPU, for the lid-driven cavity flow.

The substantial performance improvement (the left plot in Figure 4) stems from Warp’s simulation-optimized design and explicit kernel programming model. Warp allows developers to write domain-specific CUDA kernels and device functions directly in Python. This explicit approach eliminates computational overhead and enables more predictable performance for CAE simulations. Additionally, Warp’s JIT compiler performs aggressive optimizations, including loop unrolling and branch elimination, further boosting execution speed.

The memory efficiency gains (the right plot in Figure 4) reflect Warp’s explicit memory management philosophy. Warp requires developers to pre-allocate input and output arrays, eliminating hidden memory allocations and intermediate buffers. This hands-on approach, while requiring developer attention, results in a leaner memory footprint that scales predictably with problem size.

Bridging the gap between performance and productivity

The historic tradeoff between development productivity and raw computational power is no longer a necessary compromise. The XLB library—developed by Autodesk Research and accelerated by NVIDIA Warp—exemplifies this new paradigm. It proves that a Python-native framework can deliver performance on par with highly optimized, low-level code while retaining the accessibility and rapid development cycle of the Python ecosystem.