Developer Tools & Techniques

Simplify Sparse Deep Learning with Universal Sparse Tensor in nvmath-python

Apr 22, 2026

By Aart J.C. Bik, Gunja Pandav and Satya Varadhan

Discuss (3)

AI-Generated Summary

Dislike

Integration of the Universal Sparse Tensor (UST) into nvmath-python v0.9.0 enables zero-copy interoperability with PyTorch, SciPy, CuPy, and NumPy, allowing efficient conversion between dense and multiple sparse formats (COO, CSR, CSC, BSR, DIA, and custom formats) and direct referencing of underlying storage buffers.
The UST's tensor format DSL, implemented via Python objects, allows dynamic, runtime construction and inspection of both standard and novel sparsity schemes, supporting polymorphic, sparsity-agnostic operations that dispatch to optimized kernels or auto-generate sparse code, reducing the need for manual coding and enabling high performance for both common and user-defined formats.
Performance evaluations demonstrate that UST achieves significant speedups (up to 444x) over CuPy and PyTorch on SpMV workloads, with additional benefits including a planning/execution split, persistent caching to avoid redundant JIT/LTO recompilation, and seamless injection into PyTorch models via subclassing and model reformatting utilities, facilitating research on advanced sparse representations without requiring model rewrites.

AI-generated content may summarize information incompletely. Verify important information. Learn more

In a previous post, we introduced the Universal Sparse Tensor (UST), enabling developers to decouple a tensor’s sparsity from its memory layout for greater flexibility and performance. We’re excited to announce the integration of the UST into nvmath-python v0.9.0 to accelerate sparse scientific and deep learning applications.

This post provides a walkthrough of key UST features, implementation details, and performance overview, including:

Zero-cost interoperability: Data-movement-free conversion with PyTorch, SciPy, and CuPy.
Custom formats: Define novel sparsity schemes.
Polymorphic operations: Sparsity-agnostic functions automatically use optimized kernels or generate custom sparse code—eliminating the need for manual coding of new formats.
PyTorch injection: Easily inject UST performance benefits into existing PyTorch models.
Transparent caching: Avoid JIT/LTO recompilation and replanning—amortizing overhead over subsequent repeated execution of the same operation.

Tensor format DSL

The UST describes common (e.g., COO, CSR, CSC, and BSR) and less-common sparse tensor storage formats using a simple domain-specific language (DSL). For example, a CSC format is described as follows.

(i, j) -> (j : dense, i : compressed)     	# CSC

Integrating this DSL within a programming language requires design decisions with various trade-offs.

In the C++ library MatX, for instance, the DSL is purely implemented through types and templating, which enables interesting compile-time optimizations for specific storage formats. In the following dispatching method foo(), for example, format-specific tests can be evaluated at compile-time, which means that only the relevant branch is compiled and shipped, trading smaller binary sizes for slightly higher compile times.

template<TensorType> foo(TensorType a) {
  if constexpr (TensorType::Format::isCOO()) {
    ... dispatch to COO code ...
  } else if constexpr (TensorType::Format::isCSR()) {
    ... dispatch to CSR code ...
  } else ...
}

In nvmath-python, we adopt a more Pythonic way of presenting the DSL, trading runtime flexibility for some performance when inspecting the format objects. The CSC format shown above, for instance, is expressed as follows.

i, j = ( Dimension(dimension_name="i"), Dimension(dimension_name="j") )
CSC = TensorFormat( [i, j], {j: LevelFormat.DENSE, i: LevelFormat.COMPRESSED} )

A major advantage of these objects is that everything can be constructed dynamically at runtime (including parsing from strings). Performing format-specific tasks, however, requires inspecting the actual contents at runtime. Since such decisions generally happen outside performance-critical paths, trading off for generality seems an acceptable choice.

Interoperability with PyTorch, SciPy, CuPy

The nvmath-python UST implementation provides interoperability with tensors of PyTorch, SciPy, CuPy, and NumPy. The conversion is mostly zero-cost, which means that converting dense or sparse formats like COO, CSR, CSC, BSR, BSC, and DIA to a UST object or back is done without data movement or copying. Instead, the UST object references the storage buffers of the original data structure.

For example, the following conversion from a SciPy sparse matrix in COO format to an nvmath-python UST object uses the row, column, and values array to form the positions, coordinates, and values array of the UST.

row = np.array([0, 0, 1, 1, 2, 3], dtype=np.int32)
col = np.array([0, 1, 1, 3, 2, 3], dtype=np.int32)
val = np.array([1, 2, 3, 4, 5, 6], dtype=np.float32)
coo = sps.coo_array((val, (row, col)), shape=(4, 8))  # SciPy
ust = Tensor.from_package(coo)

This conversion automatically generates the DSL for COO, illustrated below. Converting other formats will result in the corresponding DSL for those formats.

 TensorFormat( [i, j] -> {i: (<LevelFormat.COMPRESSED>, <LevelProperty.NONUNIQUE>),
                            j: <LevelFormat.SINGLETON>} )

The UST object is converted back to a sparse tensor in the originating package as follows. This too is a zero-cost conversion by simply passing references to constituent buffers.

coo = ust.to_package()  # this yields a SciPy coo format again

Less common formats

Developers can also define their own novel sparsity format using the DSL. For example, the following code converts a dense PyTorch tensor into a 2-bit delta-compressed format (similar to CSR, but using runtime differences for the coordinates), not natively supported in PyTorch.

A = torch.tensor([[1, 0, 0, 0, 0, 0, 0, 2],
                  [0, 0, 3, 4, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0, 0, 0, 5]], dtype=torch.float64)
U = Tensor.from_package(A)  # dense UST
delta_format = TensorFormat([i, j], {i: LevelFormat.DENSE,
                                     j: (LevelFormat.DELTA, 2)})  # use 2-bit
S = U.convert(tensor_format=delta_format)  # sparse UST

This yields the sparse UST storage shown in Figure 1, where each coordinate uses a distance to the next stored element rather than a direct column index. Note that padding was required (the zeros shown in red) for cases where that distance would exceed 3 (maximum using 2 bits).

Printing and drawing utilities

Various utilities are provided to help users debug, test, and become familiar with the nvmath-python UST implementation. First, the dunder method __repr__ provides an unambiguous string representation that can be used to print tensor storage contents.

A = torch.arange(4 * 5).reshape(4, 5).cuda().to_sparse_csr()
U = Tensor.from_package(A)
print(U)

This snippet yields the following output, which shows the types used for the values (int64), positions and coordinate arrays in the UST storage (also int64), the device (cuda), the dimension and level extents, the stored data, a summary of the number of bytes used for storage (5*8+19*8+19*8), and the sparsity ((1 – 19 / 20) x 100%).

---- Sparse Tensor<VAL=int64,POS=int64,CRD=int64,DIM=2,LVL=2>
format   : [i, j] -> (i: <LevelFormat.DENSE>, j: <LevelFormat.COMPRESSED>)
device   : cuda
dim      : [4, 5]
lvl      : [4, 5]
nse      : 19
pos[1]   : [0, 4, 9, 14, 19] #5
crd[1]   : [1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4] #19
values   : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] #19
data     : 344 bytes
sparsity : 5.00%
----

The methods draw_storage() and draw() generate an image of the tensor storage (any dimensionality) or tensor contents (supported up to 3D). The former method was used to generate Figure 1. Calling the latter method on the previous UST example yields the image shown in Figure 2, where the grey color denotes stored nonzero elements (almost all elements in this case).

Both methods show the UST for relatively small examples since they don’t scale to very large sizes. For larger tensors, the draw_raw() method can be used (2D or 3D) to get a visualization of just the nonzero structure of the tensor, as illustrated in Figure 3 for a matrix taken from the SuiteSparse library.

Polymorphic operations

The tensor format DSL of the UST enables developers to focus on tensor sparsity. Defining the tensors’ storage is only part of the requirement, however. Developers can also specify the operations to be performed.

The nvmath-python UST implementation enables developers to define computations using sparsity-agnostic operations like matmul() and solve() (unlike alternatives such as tensor index notation). This approach is powerful because the system inspects the operand formats to determine the execution path: either dispatching to an optimized, handwritten library or kernel for common formats or relying on automatic sparse code generation when no optimized solution exists.

This design enables the use of novel sparse formats without explicit manual coding, while still achieving high performance by using existing library solutions for standard formats. The syntax for planning and performing a matmul() operation is shown below. Following nvmath-python conventions, the operation can be split into a planning phase (deciding between dispatch or JIT LTO code generation, cached for future re-use, as well as setting up all required parameters) and an execution phase (performing the actual operation).

# c := a @ b + c
with Matmul(a, b, c, beta=1.0) as mm:
    mm.plan()
    mm.execute()

This approach amortizes the initial planning cost over many executions of the same matmul() invocation, which really pays off in, for instance, iterative sparse solvers, or (pruned) linear layers in deep learning.

UST injection into PyTorch

The interoperability between PyTorch sparse tensors and nvmath-python enables developers to experiment with novel sparse storage schemes inside PyTorch models. Although powerful, AI researchers are reluctant to rewrite their model code to replace GEMM operations performed inside linear layers into the UST polymorphic operations shown above, even if this yields much better performance.

The nvmath-python implementation of the UST provides a way to “inject” the UST into existing models without rewriting the original model code. This is achieved by wrapping the UST into a subclass of torch.Tensor and providing wrappers to polymorphic operations of the UST. By providing wrappers for common PyTorch tensor operations (e.g., dot, mv, mm, and linear), researchers can use the UST without learning much of the internal workings.

These wrappers also add another caching layer (on top of the already cached JIT/LTO lookup) for reuse of planned MatMul instances for similar operations, reducing subsequent runtime to just the execution phase.

A reformat_model() method is also available, iterating through the weights of all linear layers and enabling a user-defined function to inspect, potentially prune, and sparsify (or skip) each weight matrix

weights = torchvision.models.get_model_weights(model_name).DEFAULT
model = torchvision.models.get_model(model_name, weights=weights)
model.to(device)
model.eval()
...
reformat_model(model, func=reformat)
...                                       
with torch.inference_mode():
    prediction = model(batch)

The general form of the user-provided reformatting method is shown below. Returning None signals leaving the weight matrix unchanged. After this, inference will use the UST for all linear layer applications that have been modified.

def reformat(weight):
  ... inspect (possibly even prune weight), then decide on storage ...
  if (...)
      return TorchUST.from_torch(weight.to_sparse_csr())
  return None
}

Performance examples

First, we compare SpMV performance using CuPy’s @-operation across natively supported formats (COO, CSR, and DIA), along with PyTorch mv() for COO and CSR against UST’s matmul() on the 1,489,752 x 1,48,9752 matrix atmosmodl from the Matrix Market. For the latter, we measure the execute() runtime only, which is a fair comparison for applications with repeated multiplication (since neither CuPy nor PyTorch provides a planning setup). The resulting runtimes are shown in Figure 4 (using a logarithmic scale for the vertical axis).

UST achieved speedups ranging from 1.1 to 444 in this scenario. The poor performance of the CuPy and PyTorch COO implementation is noted. For the CSR format, all versions utilize the cuSPARSE implementation, with UST benefiting from reusing the planning phase. The speedup seen with the DIA format is due to UST’s use of a specialized kernel, in contrast to CuPy’s method of first converting to CSR (and PyTorch’s lack of DIA).

The second experiment is based on the publication MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity, which provides an efficient implementation of the SpMV operation, an important step for single-token inference in deep learning. The authors introduce the MACKO format, which is essentially the delta-compressed format of Figure 1, and provide an impressive NVIDIA CUDA implementation that already delivers speedups over dense implementations for 50% unstructured sparsity and up.

Since nvmath-python UST provides an easy way to incorporate new hand-written kernels in the lookup mechanism, we experimented with using their implementation as a backend kernel implementation for the delta compressed format. Figure 5 compares the runtime of a dense implementation (GEMV), the cuSPARSE implementation of SPMV, the MACKO UST, and original MACKO implementation for 8192×8192 uniform random sparse matrices, varying from 0% sparse (dense) to 100% sparse (zero). MACKO UST performs slightly better since padding stops at the end of each row (an optimization that the original MACKO could easily incorporate as well, of course).

Learn more

For a deeper understanding of the nvmath-python UST implementation, please refer to the nvmath-python online documentation.

Discuss (3)

About the Authors

About Aart J.C. Bik
Aart J.C. Bik received his master’s degree from Utrecht University and his PhD from Leiden University. He was a principal engineer at Intel, where he was the lead compiler architect of automatic vectorization. At Google he worked on the large-scale graph processing system Pregel, Google Glass, the Android optimizing compiler, and the MLIR Sparsifier. He is now a distinguished engineer at NVIDIA, researching sparsity in AI.

View all posts by Aart J.C. Bik

About Gunja Pandav
Gunja Pandav is a senior product manager for NVIDIA math libraries. She holds a PhD in Computational Physics from the University of Texas at Austin. Prior to joining NVIDIA, Gunja spent 10 years in the industrial AI space developing models and software for enterprise AI.

View all posts by Gunja Pandav