nvmath-python (Beta) is an open source library that gives Python applications high-performance pythonic access to the core mathematical operations implemented in the NVIDIA CUDA-X™ Math Libraries for accelerated library, framework, deep learning compiler, and application development.

Install From Pip:

Learn More

Other Links:

Build From SourceGitHub

Key Features

Truly Pythonic Accelerated Math

nvmath-python provides pythonic host and device APIs for using the highly optimized NVIDIA math libraries in Python applications, without the need for intermediary C or C++ bindings. This allows Python applications across deep learning, data processing, and more to leverage the power of NVIDIA hardware for computations out-of-the-box.

nvmath-python provides freedom of using either stateless (function-style) or stateful (class-style) APIs, depending on how much control of expressiveness a user wants. The library also provides integration with the standard logger library to offer visibility into computational details.

Interoperability With Python Ecosystem

nvmath-python works in conjunction with popular Python packages. This includes GPU-based packages like CuPy, PyTorch, and RAPIDS and CPU-based packages like NumPy, SciPy, and scikit-learn. You can keep using familiar data structures and workflows while benefiting from accelerated math through nvmath-python.

In combination with Python compilers, such as Numba, you can implement device callback functions to customize nvmath-python behavior.

High Performance

nvmath-python delivers performance comparable to the underlying C libraries. The APIs are designed for minimal overhead, allowing near-native performance for Python applications. Callbacks allow device kernels fusion without the extra overheads of a host.

nvmath-python Supported Operations

Linear Algebra

Advanced matrix-matrix multiplication allows performing fused kernel matrix-matrix multiplications with a bias, different scaling factors, and epilog functions.

The operation is available in a variety of precisions, both as a host API and as a device API.


The nvmath-python library offers a specialized matrix multiplication interface to perform scaled matrix-matrix multiplication with predefined epilogues as a single fused kernel. This can potentially lead to significantly better efficiency. On top of that, nvmath-python’s stateful APIs decompose such operations into planning, autotuning, and execution phases, which enables amortization of one-time preparatory costs across multiple executions.

nvmath-python linear algebra performance
Advanced matmul performance is shown on H100 PCIe for matrices A[m×n], B[n×k], bias[m], where m=65536, n=16384, k=8192. The data type of the operands and result is bfloat16, float32 type is used for compute.

Fast Fourier Transforms

Backed by the NVIDIA cuFFT library, nvmath-python provides a powerful set of APIs to perform N-dimensional discrete Fourier Transformations. These include forward and inverse transformations for complex-to-complex, complex-to-real, and real-to-complex cases. The operations are available in a variety of precisions, both as host and device APIs.

nvmath-python also provides a set of utility functions APIs to facilitate an implementation of user-specific caching for FFT plans.

nvmath-python FFT performance
Fast fourier transform performance is shown on H100 PCIe for FFTs of size 512 computed in 1048576 (220) batches using complex64 data type.

The user can provide callback functions written in Python to selected nvmath-python operations like FFT, which results in a fused kernel and can lead to significantly better performance. Advanced users may benefit from nvmath-python device APIs that enable fusing core mathematical operations like FFT and matrix multiplication into a single kernel, bringing performance close to the theoretical maximum.


Get started with nvmath-python

Download Now