Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built for users to run sparse solvers at massive scale with minimal code changes, unlocking breakthrough speed and efficiency for next-generation engineering and design.
You can leverage your CPU/GPU using hybrid memory mode to run larger problems that otherwise would not fit in a single GPU memory, or run a workload across multiple GPUs, or even scale to multiple nodes effortlessly. This blog discusses cuDSS user strategies for solving large-scale problems.
Getting started
To get started, this blog assumes you already have a working code that uses cuDSS. You may have also explored the introductory examples on GitHub (here and here) that demonstrate running cuDSS on a single GPU and adjusting default solution parameters using Get and Set functions. These examples cover creating matrices and main cuDSS objects, and executing the three core phases of cuDSS: analysis, numerical factorization, and solution.
Thanks to the increased memory capacity of recent GPU generations, even a single GPU can handle fairly large sparse problems. However, when tackling truly massive problems—on the order of over 10 million rows and over a billion nonzeros—there are effective strategies to make cuDSS run fast and efficiently. The first approach still uses a single GPU but introduces techniques to address these bigger challenges without major code changes.
Rethink your data types: Why INT64 matters now
When you create a dense or sparse matrix for cuDSS, you are likely to use one of two functions, cudssMatrixCreateDn() or cudssMatrixCreateCsr() or even both. From the documentation, the functions are described below.
cudssMatrixCreateDn
cudssStatus_t cudssMatrixCreateDn(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t ld,
void *values,
cudaDataType_t valueType,
cudssLayout_t layout
)
The second function, cudssMatrixCreateCsr(), is shown next.
cudssMatrixCreateCsr
cudssStatus_t cudssMatrixCreateCsr(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t nnz,
void *rowStart,
void *rowEnd,
void *colIndices,
void *values,
cudaDataType_t indexType,
cudaDataType_t valueType,
cudssMatrixType_t mtype,
cudssMatrixViewType_t mview,
cudssIndexBase_t indexBase
)
In cuDSS versions before 0.7.0, indices of the sparse matrices could only use 32-bit integers. Specifically, the underlying data type for rowStart, rowEnd, and colIndices could only be int and parameter indexType could only be CUDA_R_32I. From cuDSS 0.7.0 onward, you can solve bigger problems by using 64-bit integer indexing arrays of type int64 and CUDA_R_64I for the indexType.
Note: There is a limitation of having less than 2^31 rows and columns in the input matrix (but with 64-bit indices, the input matrix can have many more nonzeros).
Hybrid memory mode—blurring the line between CPU and GPU
cuDSS hybrid memory mode is designed to overcome the memory limitations of a single GPU when solving extremely large sparse linear problems by using the GPU and CPU memories.
However, there’s a tradeoff: Data transfer between CPU and GPU takes time and is governed by bus bandwidth. While you get to tackle bigger problems, you should expect some performance hit due to these transfers. That said, thanks to modern NVIDIA driver optimizations and fast CPU/GPU interconnects (such as those in NVIDIA Grace Blackwell nodes), the penalty is manageable—and for certain problem sizes, hybrid memory performance scales impressively.
Hybrid memory mode is not on by default, so the first step to enable it is to call the function cudssConfigSet() to set CUDSS_CONFIG_HYBRID_MODE, which tells cuDSS to use hybrid memory mode. Note this change has to be done prior to calling cudssExecute().
The device memory is managed by cuDSS automatically by default. It manages how much device memory is needed—as much as the entire GPU contains. Alternatively, users can specify a smaller memory footprint by setting a user-defined limit in the range from the value of CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN up to the available device memory after the analysis (symbolic factorization) phase, which can be queried via the NVIDIA CUDA Runtime API cudaMemGetInfo. A few highlights to note:
- Even if the hybrid memory is on, cuDSS first attempts to utilize device memory (and avoids using CPU memory if possible) to achieve best performance.
- Best performance is achieved with using the maximum GPU memory (which would make fewer memory transfers between the CPU and the GPU)
- Hybrid memory limit can be set per device (as shown in the next text block)
The example code walks you through fetching minimum device memory requirements and setting your memory limits accordingly, giving you fine control over memory footprints.
...
/* Enable hybrid mode where factors are stored in host memory
Note: It must be set before the first call to ANALYSIS step.*/
int hybrid_mode = 1;
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, CUDSS_CONFIG_HYBRID_MODE,\
&hybrid_mode,sizeof(hybrid_mode)), status,\
"cudssConfigSet CUDSS_CONFIG_HYBRID_MODE");
/* Symbolic factorization */
...
/* (optional) User can query the minimal amount of device memory sufficient
for the hybrid memory mode.
Note: By default, cuDSS would attempt to use all available
device memory if needed */
size_t sizeWritten;
int64_t device_memory_min;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
&device_memory_min, sizeof(device_memory_min),\
&sizeWritten), status,
"cudssDataGet for\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN");
printf("cuDSS example: minimum amount of device memory\n"
"for the hybrid memory mode is %ld bytes\n",
device_memory_min);
/* (optional) User can specify how much device memory is available for
cuDSS
Note: By default, cuDSS would attempt to use all available\
device memory if needed */
int64_t hybrid_device_memory_limit = 40 * 1024 ; // in bytes = 40 KB
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT,\
&hybrid_device_memory_limit,\
sizeof(hybrid_device_memory_limit)),\
status, \
"cudssConfigSet for\
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT");
printf("cuDSS example: set the upper limit on device memory\n"
"for the hybrid memory mode to %ld bytes\n",
hybrid_device_memory_limit);
/* Factorization */
...
/* Solving */
...
The first cuDSS function, called cudssConfigSet(), enables hybrid memory mode before calling the first analysis step, symbolic factorization. This is followed by using cudssDataGet() to find the minimal amount of device memory sufficient for hybrid memory mode. A function call, cudssConfigSet() specifies the amount of device memory for cuDSS. Note that sometimes the automatic memory management results in out of memory (OOM) errors.
For developers integrating this, the documentation on debugging tips are gold—save yourself some headaches by giving them a read.
Hybrid memory mode performance is dependent on CPU/GPU memory bandwidth to move data from CPU to GPU. To illustrate this, Figure 1 below shows the factorization and solve speed-up for matrix sizes ranging from 1 million to 18 million that are solved using cuDSS’s hybrid memory mode. The baseline is a single NVIDIA B200 GPU node. The observed speed-up compares the same model executed on a Grace Blackwell node to a x86 Blackwell node, reflecting the memory bandwidth ratio between the two nodes.

With INT64 and hybrid memory mode cuDSS coding strategies, we can accommodate large problem sizes and we are using all the possible memory we can on the node if we need it. But we are still limited to a single GPU. The next strategy allows us to use more GPUs to accommodate larger problems. This also allows us to solve problems of a fixed size faster by using more GPUs.
Multiply your muscle: multi-GPU mode (MG mode)
cuDSS multi-GPU mode (MG mode) allows the developer to use all of the GPUs in a single node without the developer having to specify any distributed communication layer. cuDSS handles all communication needed to use the GPUs internally. It is helpful in three scenarios:
- When the problem is too large to fit on a single device (with or without hybrid memory).
- When the user wants to avoid the performance penalty of hybrid memory mode.
- When the user is focused on strong scaling—solving the problem across more GPUs to reach a solution faster.
The highlight of MG mode is that the developer does not need to specify a communication layer: No MPI, no NCCL, and no other communication layer needs to be used. cuDSS does all of this for you.
Additionally, due to CUDA’s limitations with MPI-aware communication on Windows nodes, MG mode becomes particularly valuable for applications running on Windows.
Figure 2 below illustrates the time (in seconds) required to solve an approximately 30-million-row matrix on an NVIDIA DGX H200 node across one-, two-, and four-GPU configurations with the factorization time on the top chart and the solve time on the bottom chart. The initial computation was performed on a single GPU, followed by runs using two and four GPUs with MG mode. As shown, solving the model with two GPUs significantly reduces computation time compared to a single GPU, albeit at the cost of increased GPU resource usage.

Figure 2. Factorization and solve time on H200 for one-, two-, and four-GPU configurations using Cadence’s MCAE applications. The matrix has approximately 31M rows and columns and with approximately 1B non-zeros.
This example shows you how to utilize MG mode. The relevant parts of the code are summarized below. Note that this includes code for using hybrid memory mode. This is important because if you use hybrid memory, you have to set the device memory limits on all of the devices that will be used.
...
/* Creating the cuDSS library handle */
cudssHandle_t handle;
/* Query the actual number of available devices */
int device_count = 0;
cuda_error = cudaGetDeviceCount(&device_count);
if (cuda_error != cudaSuccess || device_count <= 0) {
printf("ERROR: no GPU devices found\n");
fflush(0);
return -1;
}
/* device_indices can be set to NULL. In that cases cuDSS will take devices
* from 0 to (device_count - 1)
*/
int *device_indices = NULL;
device_indices = (int *)malloc(device_count * sizeof(int));
if (device_indices == NULL) {
printf("ERROR: failed to allocate host memory\n");
fflush(0);
return -1;
}
for (int i = 0; i < device_count; i++)
device_indices[i] = i;
...
/* Initialize cudss handle for multiple devices */
CUDSS_CALL_AND_CHECK(cudssCreateMg(&handle, device_count, device_indices),\
status, "cudssCreate");
...
/* Creating cuDSS solver configuration and data objects */
cudssConfig_t solverConfig;
cudssData_t solverData;
CUDSS_CALL_AND_CHECK(cudssConfigCreate(&solverConfig), status,\
"cudssConfigCreate");
/* Pass same device_count and device_indices to solverConfig */
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, \
CUDSS_CONFIG_DEVICE_COUNT, &device_count,\
sizeof(device_count)), status, \
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
CUDSS_CONFIG_DEVICE_INDICES, device_indices,\
device_count * sizeof(int)), status, \
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssDataCreate(handle, &solverData), status,\
"cudssDataCreate");
...
/* Symbolic factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_ANALYSIS,\
solverConfig, solverData, A, x, b),\
status, "cudssExecute for analysis");
...
/* Query CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN should be done for each device
* separately by calling cudaSetDevice() prior to cudssDataGet.
* Same for getting CUDSS_DATA_MEMORY_ESTIMATES.
* Same for setting CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT with
* cudssConfigSet()
*/
int default_device = 0;
cudaGetDevice(&default_device);
for (int dev_id = 0; dev_id < device_count; dev_id++) {
cudaSetDevice(device_indices[dev_id]);
int64_t hybrid_device_memory_limit = 0;
size_t sizeWritten;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
&hybrid_device_memory_limit,\
sizeof(hybrid_device_memory_limit),\
&sizeWritten),\
status, "cudssDataGet for the memory estimates");
printf("dev_id = %d CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN %ld bytes\n",
device_indices[dev_id], hybrid_device_memory_limit);
}
/* cuDSS requires all API calls to be made on the default device, so
* resseting device context.
*/
cudaSetDevice(default_device);
/* Factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_FACTORIZATION,\
solverConfig, solverData, A, x, b),\
status, "cudssExecute for factor");
/* Solving */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE, solverConfig,\
solverData, A, x, b), status, \
"cudssExecute for solve");
...
Setting up MG mode is straightforward. It begins by finding the number of devices on the node and using them all, or the specific number of devices you want. Then the device indices are set to the number of devices starting with device 0 (the code will use the first device_count devices, which you can change to the specific device numbers if you want). You could easily have the number of devices and the device number list input on the command line or via a file to make your code more flexible.
After this, the specific MG coding begins by calling cudssCreateMg() to initialize the cuDSS handle for multiple devices. But before you call a solution phase, you additionally need to initialize the cuDSS configuration with the device information. Specifically, after creating a cuDSS solver configuration object using cudssConfigCreate(), you should set the configuration details for MG mode using cudssConfigSet() for the following:
CUDSS_CONFIG_DEVICE_COUNT, using the arraydevice_count.CUDSS_CONFIG_DEVICE_INDICES, using the arraydevice_indices.
Then you use the function cudssDataCreate() to create the solverData for cuDSS and perform the analysis stage next.
In case you are using hybrid memory mode, prior to the factorization, you might want to set device memory limits for each of the devices separately. This is shown in the code above. Once completed, you can factorize the matrix and solve the problem.
A highlight of MG mode is that you don’t need to code for communications between the GPUs. cuDSS does all of that for you. However, there are some current limitations to using MG mode.
- Using MG mode jointly with multi-GPU-multi-node (MGMN) mode is not supported (the next section talks about MGMN)
- Distributed input is not currently supported.
- MG mode is not supported when either
CUDSS_ALG_1orCUDSS_ALG_2is used for reordering. - MG mode does not support matrix batches.
- All phases in MG mode are synchronous.
- All data must be on the first device (rank 0) before calling
cudssExecute. Then cuDSS will distribute the data to the other devices as needed.
Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power
Now, what if one node isn’t enough and you want to spread your computations across multiple nodes? That’s where MGMN mode comes in. This requires a communication layer that, once added, will allow you to use any or all of the GPUs in a node as well as multiple nodes—with no limitations. This enables users to solve massive problems, or to use more GPUs to solve a problem faster.
cuDSS uses an abstraction—a small communication “shim” layer that can be tailored to CUDA-aware Open MPI, NVIDIA NCCL, or even a custom one, if you want.
This example for MGMN has the code for both Open MPI and NCCL. If you wish to use your own communication layer, there is an explanation on how to do that.
To illustrate how the communication layer is used, an ifdef code block from the example is presented in the code block below for both MPI and NCCL code. There are some constants defined during compilation that are important for this example but aren’t shown in the code block. These are USE_MPI and USE_NCCL that define what code paths are to be used.
This ifdef code block is for lines 520-535 in the sample code (these line numbers could change with subsequent versions so check them carefully).
#ifdef USE_MPI
#if USE_OPENMPI
if (strcmp(comm_backend_name,"openmpi") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
mpi_comm, sizeof(MPI_Comm*)), \
status, \
"cudssDataSet for OpenMPI comm");
}
#endif
#if USE_NCCL
if (strcmp(comm_backend_name,"nccl") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
nccl_comm, sizeof(ncclComm_t*)), \
status, \
"cudssDataSet for NCCL comm");
}
#endif
#endif
Note that the code changes are basically minimal for defining MPI or NCCL. The code differences between the two are simple. You can use your own communication layer in a very similar manner.
Once you define the communicator pointer, passed to cuDSS via CUDSS_DATA_COMM in your code, as shown in the previous code snippet, there is no other need to use any communication layer function unless your code specifically needs it. cuDSS uses the defined communication layer “under the covers” so you don’t need to code for it. Look through the example code for how more than one node is used.
For implementing your own communication layer, a good introductory discussion can be found in the cuDSS documentation under advanced topics.
A high-level overview of the communication layers requirements are below:
- The MGMN mode is enabled by abstracting away all communication-specific primitives into a small, separately built shim communication layer.
- Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).
- The enabled MGMN execution in cuDSS does not require any changes for applications that do not make use of the MGMN mode.
- MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, dense righthand side, or solution, using the
cudssMatrixSetDistributedRow1D()function (see the next paragraph).
cuDSS MGMN mode optionally accepts pre-distributed input and can optionally create distributed output. You can have both A and B on the rank 0 device. In which case, cuDSS will distribute them, or you can tell cuDSS how the data is distributed across the devices and nodes with the cudssMatrixSetDistributedRow1d() function.The developer will have to make sure the data is in the proper location on the proper node and device.
A critical step for good performance is to carefully choose your CPU:GPU:NIC bindings. This is not discussed here but is documented elsewhere.
There are some current limitations to MGMN mode,
- MGMN mode is not supported when either
CUDSS_ALG_1 or CUDSS_ALG_2is used for reordering. - MGMN mode does not support matrix batches.
- All phases in MGMN mode are synchronous.
Takeaways
Sparse linear systems appear in many disciplines. Fueled by the need to solve real-life problems, the overall size of the problems is growing at a very rapid rate. Developers must find ways to solve them both efficiently and quickly. NVIDIA cuDSS provides an easy to use library for solving increasingly massive problems using NVIDIA GPUs.
For more features you can use with cuDSS, it is recommended that you read through the advanced feature section of the documents. They contain more information about the features presented here as well as other capabilities to help you solve your large sparse linear problems. There is also a section that explains how to do logging with cuDSS as you develop your code. This is a great resource since debugging parallel code can be challenging. cuDSS has some great capability for getting log information as your code is executed.
Subscribe to cuDSS on the customer page to remain updated on most recent innovations.