Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built for users to run sparse solvers at massive scale with minimal code changes, unlocking breakthrough speed and efficiency for next-generation engineering and design.

You can leverage your CPU/GPU using hybrid memory mode to run larger problems that otherwise would not fit in a single GPU memory, or run a workload across multiple GPUs, or even scale to multiple nodes effortlessly. This blog discusses cuDSS user strategies for solving large-scale problems.

Getting started

To get started, this blog assumes you already have a working code that uses cuDSS. You may have also explored the introductory examples on GitHub (here and here) that demonstrate running cuDSS on a single GPU and adjusting default solution parameters using Get and Set functions. These examples cover creating matrices and main cuDSS objects, and executing the three core phases of cuDSS: analysis, numerical factorization, and solution.

Thanks to the increased memory capacity of recent GPU generations, even a single GPU can handle fairly large sparse problems. However, when tackling truly massive problems—on the order of over 10 million rows and over a billion nonzeros—there are effective strategies to make cuDSS run fast and efficiently. The first approach still uses a single GPU but introduces techniques to address these bigger challenges without major code changes.

Rethink your data types: Why INT64 matters now

When you create a dense or sparse matrix for cuDSS, you are likely to use one of two functions, cudssMatrixCreateDn() or cudssMatrixCreateCsr() or even both. From the documentation, the functions are described below.

cudssMatrixCreateDn
cudssStatus_t cudssMatrixCreateDn(
    cudssMatrix_t *matrix,
    int64_t nrows,
    int64_t ncols,
    int64_t ld,
    void *values,
    cudaDataType_t valueType,
    cudssLayout_t layout
)

The second function, cudssMatrixCreateCsr(), is shown next.

cudssMatrixCreateCsr
cudssStatus_t cudssMatrixCreateCsr(
    cudssMatrix_t *matrix,
    int64_t nrows,
    int64_t ncols,
    int64_t nnz,
    void *rowStart,
    void *rowEnd,
    void *colIndices,
    void *values,
    cudaDataType_t indexType,
    cudaDataType_t valueType,
    cudssMatrixType_t mtype,
    cudssMatrixViewType_t mview,
    cudssIndexBase_t indexBase
)

In cuDSS versions before 0.7.0, indices of the sparse matrices could only use 32-bit integers. Specifically, the underlying data type for rowStart, rowEnd, and colIndices could only be int and parameter indexType could only be CUDA_R_32I. From cuDSS 0.7.0 onward, you can solve bigger problems by using 64-bit integer indexing arrays of type int64 and CUDA_R_64I for the indexType.

Note: There is a limitation of having less than 2^31 rows and columns in the input matrix (but with 64-bit indices, the input matrix can have many more nonzeros).

Hybrid memory mode—blurring the line between CPU and GPU

cuDSS hybrid memory mode is designed to overcome the memory limitations of a single GPU when solving extremely large sparse linear problems by using the GPU and CPU memories.

However, there’s a tradeoff: Data transfer between CPU and GPU takes time and is governed by bus bandwidth. While you get to tackle bigger problems, you should expect some performance hit due to these transfers. That said, thanks to modern NVIDIA driver optimizations and fast CPU/GPU interconnects (such as those in NVIDIA Grace Blackwell nodes), the penalty is manageable—and for certain problem sizes, hybrid memory performance scales impressively.

Hybrid memory mode is not on by default, so the first step to enable it is to call the function cudssConfigSet() to set CUDSS_CONFIG_HYBRID_MODE, which tells cuDSS to use hybrid memory mode. Note this change has to be done prior to calling cudssExecute().

The device memory is managed by cuDSS automatically by default. It manages how much device memory is needed—as much as the entire GPU contains. Alternatively, users can specify a smaller memory footprint by setting a user-defined limit in the range from the value of CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN up to the available device memory after the analysis (symbolic factorization) phase, which can be queried via the NVIDIA CUDA Runtime API cudaMemGetInfo. A few highlights to note:

Even if the hybrid memory is on, cuDSS first attempts to utilize device memory (and avoids using CPU memory if possible) to achieve best performance.
Best performance is achieved with using the maximum GPU memory (which would make fewer memory transfers between the CPU and the GPU)
Hybrid memory limit can be set per device (as shown in the next text block)

The example code walks you through fetching minimum device memory requirements and setting your memory limits accordingly, giving you fine control over memory footprints.

...
    /* Enable hybrid mode where factors are stored in host memory
       Note: It must be set before the first call to ANALYSIS step.*/
    int hybrid_mode = 1;
    CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, CUDSS_CONFIG_HYBRID_MODE,\
                         &hybrid_mode,sizeof(hybrid_mode)), status,\
                         "cudssConfigSet CUDSS_CONFIG_HYBRID_MODE");

    /* Symbolic factorization */
...
    /* (optional) User can query the minimal amount of device memory sufficient 
                  for the hybrid memory mode.
                  Note: By default, cuDSS would attempt to use all available
                        device memory if needed */
    size_t sizeWritten;
    int64_t device_memory_min;
    CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
                         CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
                         &device_memory_min, sizeof(device_memory_min),\
                         &sizeWritten), status,
                         "cudssDataGet for\
                         CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN");
    printf("cuDSS example: minimum amount of device memory\n"
           "for the hybrid memory mode is %ld bytes\n",
            device_memory_min);

    /* (optional) User can specify how much device memory is available for
                  cuDSS
                  Note: By default, cuDSS would attempt to use all available\
                        device memory if needed */
    int64_t hybrid_device_memory_limit = 40 * 1024 ; // in bytes = 40 KB
    CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
                         CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT,\
                         &hybrid_device_memory_limit,\
                         sizeof(hybrid_device_memory_limit)),\
                         status, \
                         "cudssConfigSet for\
                         CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT");
    printf("cuDSS example: set the upper limit on device memory\n"
           "for the hybrid memory mode to %ld bytes\n",
           hybrid_device_memory_limit);

    /* Factorization */
...
    /* Solving */
...

The first cuDSS function, called cudssConfigSet(), enables hybrid memory mode before calling the first analysis step, symbolic factorization. This is followed by using cudssDataGet() to find the minimal amount of device memory sufficient for hybrid memory mode. A function call, cudssConfigSet() specifies the amount of device memory for cuDSS. Note that sometimes the automatic memory management results in out of memory (OOM) errors.

For developers integrating this, the documentation on debugging tips are gold—save yourself some headaches by giving them a read.

Hybrid memory mode performance is dependent on CPU/GPU memory bandwidth to move data from CPU to GPU. To illustrate this, Figure 1 below shows the factorization and solve speed-up for matrix sizes ranging from 1 million to 18 million that are solved using cuDSS’s hybrid memory mode. The baseline is a single NVIDIA B200 GPU node. The observed speed-up compares the same model executed on a Grace Blackwell node to a x86 Blackwell node, reflecting the memory bandwidth ratio between the two nodes.

This figure is a plot of speedup of factorization and solution phases for NVIDIA GB200 versus NVIDIA DGX B200. This is for NVIDIA cuDSS in hybrid memory mode using the minimum required device memory. Note that the NVIDIA GB200 has 72 cores and 480 Gigabytes of memory Grace (72 cores) and the NVIDIA B200 has 112 CPU cores. The figure shows that hybrid memory mode has a speed up of at least 1.4 for a matrix with approximately 18.2 million, and up to almost 4x for the factorization phase with a matrix of 10.6 million rows and columns. The solve phase has a speed-up from 2.3x, for all matrix sizes from rows and columns to 18.4 million rows and columns. — Figure 1. Speedup of factorization and solution phases for GB200 vs. B200 (cuDSS in Hybrid Memory Mode using minimum required device memory: B200 + Grace (72 cores) – 480GB vs B200 + X86 CPU (112 cores)

With INT64 and hybrid memory mode cuDSS coding strategies, we can accommodate large problem sizes and we are using all the possible memory we can on the node if we need it. But we are still limited to a single GPU. The next strategy allows us to use more GPUs to accommodate larger problems. This also allows us to solve problems of a fixed size faster by using more GPUs.

Multiply your muscle: multi-GPU mode (MG mode)

cuDSS multi-GPU mode (MG mode) allows the developer to use all of the GPUs in a single node without the developer having to specify any distributed communication layer. cuDSS handles all communication needed to use the GPUs internally. It is helpful in three scenarios:

When the problem is too large to fit on a single device (with or without hybrid memory).
When the user wants to avoid the performance penalty of hybrid memory mode.
When the user is focused on strong scaling—solving the problem across more GPUs to reach a solution faster.

The highlight of MG mode is that the developer does not need to specify a communication layer: No MPI, no NCCL, and no other communication layer needs to be used. cuDSS does all of this for you.

Additionally, due to CUDA’s limitations with MPI-aware communication on Windows nodes, MG mode becomes particularly valuable for applications running on Windows.

Figure 2 below illustrates the time (in seconds) required to solve an approximately 30-million-row matrix on an NVIDIA DGX H200 node across one-, two-, and four-GPU configurations with the factorization time on the top chart and the solve time on the bottom chart. The initial computation was performed on a single GPU, followed by runs using two and four GPUs with MG mode. As shown, solving the model with two GPUs significantly reduces computation time compared to a single GPU, albeit at the cost of increased GPU resource usage.

Figure 2 presents the factorization time and the solve time on a NVIDIA DGX H200 system for one-, two-, and four-GPU configurations using Cadence’s MCAE applications. The matrix used in the analysis is approximately 31 million rows and columns with approximately 1 billion non-zero entries in the matrix. From one GPU to four GPUs, the factorization time decreases from about 20 seconds to just under 10 seconds. For the solution time, for one GPU, the solution time is about 0.25 seconds. For four GPUs, the solution time is approximately 0.09 seconds. — Figure 2. Factorization and solve time on H200 for one-, two-, and four-GPU configurations using Cadence’s MCAE applications. The matrix has approximately 31M rows and columns and with approximately 1B non-zeros.

This example shows you how to utilize MG mode. The relevant parts of the code are summarized below. Note that this includes code for using hybrid memory mode. This is important because if you use hybrid memory, you have to set the device memory limits on all of the devices that will be used.

...
    /* Creating the cuDSS library handle */
    cudssHandle_t handle;

    /* Query the actual number of available devices */
    int device_count = 0;
    cuda_error       = cudaGetDeviceCount(&device_count);
    if (cuda_error != cudaSuccess || device_count <= 0) {
        printf("ERROR: no GPU devices found\n");
        fflush(0);
        return -1;
    }

    /* device_indices can be set to NULL. In that cases cuDSS will take devices
     *     from 0 to (device_count - 1)
     */
    int *device_indices = NULL;
    device_indices      = (int *)malloc(device_count * sizeof(int));
    if (device_indices == NULL) {
        printf("ERROR: failed to allocate host memory\n");
        fflush(0);
        return -1;
    }
    for (int i = 0; i < device_count; i++)
        device_indices[i] = i;
...
    /* Initialize cudss handle for multiple devices */
    CUDSS_CALL_AND_CHECK(cudssCreateMg(&handle, device_count, device_indices),\
                         status, "cudssCreate");
...
    /* Creating cuDSS solver configuration and data objects */
    cudssConfig_t solverConfig;
    cudssData_t   solverData;

    CUDSS_CALL_AND_CHECK(cudssConfigCreate(&solverConfig), status,\
                         "cudssConfigCreate");

    /* Pass same device_count and device_indices to solverConfig */
    CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, \
                         CUDSS_CONFIG_DEVICE_COUNT, &device_count,\
                         sizeof(device_count)), status, \
                         "cudssConfigSet for device_count");

    CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
                         CUDSS_CONFIG_DEVICE_INDICES, device_indices,\
                         device_count * sizeof(int)), status, \
                         "cudssConfigSet for device_count");

    CUDSS_CALL_AND_CHECK(cudssDataCreate(handle, &solverData), status,\
                         "cudssDataCreate");

...
    /* Symbolic factorization */
    CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_ANALYSIS,\   
                         solverConfig, solverData, A, x, b),\
                         status, "cudssExecute for analysis");
...
    /* Query CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN should be done for each device
     * separately by calling cudaSetDevice() prior to cudssDataGet.
     * Same for getting CUDSS_DATA_MEMORY_ESTIMATES.
     * Same for setting CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT with
     * cudssConfigSet()
     */
    int default_device = 0;
    cudaGetDevice(&default_device);
    for (int dev_id = 0; dev_id < device_count; dev_id++) {
        cudaSetDevice(device_indices[dev_id]);

        int64_t hybrid_device_memory_limit = 0;
        size_t  sizeWritten;
        CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
                             CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
                             &hybrid_device_memory_limit,\
                             sizeof(hybrid_device_memory_limit),\
                                    &sizeWritten),\
                             status, "cudssDataGet for the memory estimates");

        printf("dev_id = %d CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN %ld bytes\n",
               device_indices[dev_id], hybrid_device_memory_limit);
    }
    /* cuDSS requires all API calls to be made on the default device, so
     * resseting device context.
     */
    cudaSetDevice(default_device);

    /* Factorization */
    CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_FACTORIZATION,\
                         solverConfig, solverData, A, x, b),\
                         status, "cudssExecute for factor");

    /* Solving */
    CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE, solverConfig,\
                         solverData, A, x, b), status, \
                         "cudssExecute for solve");
...

Setting up MG mode is straightforward. It begins by finding the number of devices on the node and using them all, or the specific number of devices you want. Then the device indices are set to the number of devices starting with device 0 (the code will use the first device_count devices, which you can change to the specific device numbers if you want). You could easily have the number of devices and the device number list input on the command line or via a file to make your code more flexible.

After this, the specific MG coding begins by calling cudssCreateMg() to initialize the cuDSS handle for multiple devices. But before you call a solution phase, you additionally need to initialize the cuDSS configuration with the device information. Specifically, after creating a cuDSS solver configuration object using cudssConfigCreate(), you should set the configuration details for MG mode using cudssConfigSet() for the following:

CUDSS_CONFIG_DEVICE_COUNT, using the array device_count.
CUDSS_CONFIG_DEVICE_INDICES, using the array device_indices.

Then you use the function cudssDataCreate() to create the solverData for cuDSS and perform the analysis stage next.

In case you are using hybrid memory mode, prior to the factorization, you might want to set device memory limits for each of the devices separately. This is shown in the code above. Once completed, you can factorize the matrix and solve the problem.

A highlight of MG mode is that you don’t need to code for communications between the GPUs. cuDSS does all of that for you. However, there are some current limitations to using MG mode.

Using MG mode jointly with multi-GPU-multi-node (MGMN) mode is not supported (the next section talks about MGMN)
Distributed input is not currently supported.
MG mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.
MG mode does not support matrix batches.
All phases in MG mode are synchronous.
All data must be on the first device (rank 0) before calling cudssExecute. Then cuDSS will distribute the data to the other devices as needed.

Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power

Now, what if one node isn’t enough and you want to spread your computations across multiple nodes? That’s where MGMN mode comes in. This requires a communication layer that, once added, will allow you to use any or all of the GPUs in a node as well as multiple nodes—with no limitations. This enables users to solve massive problems, or to use more GPUs to solve a problem faster.

cuDSS uses an abstraction—a small communication “shim” layer that can be tailored to CUDA-aware Open MPI, NVIDIA NCCL, or even a custom one, if you want.

This example for MGMN has the code for both Open MPI and NCCL. If you wish to use your own communication layer, there is an explanation on how to do that.

To illustrate how the communication layer is used, an ifdef code block from the example is presented in the code block below for both MPI and NCCL code. There are some constants defined during compilation that are important for this example but aren’t shown in the code block. These are USE_MPI and USE_NCCL that define what code paths are to be used.

This ifdef code block is for lines 520-535 in the sample code (these line numbers could change with subsequent versions so check them carefully).

#ifdef USE_MPI
    #if USE_OPENMPI
    if (strcmp(comm_backend_name,"openmpi") == 0) {
        CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
                                          mpi_comm, sizeof(MPI_Comm*)), \
                                          status, \
                                          "cudssDataSet for OpenMPI comm");
    }
    #endif
    #if USE_NCCL
    if (strcmp(comm_backend_name,"nccl") == 0) {
        CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
                                          nccl_comm, sizeof(ncclComm_t*)), \
                                          status, \
                                          "cudssDataSet for NCCL comm");
    }
    #endif
#endif

Note that the code changes are basically minimal for defining MPI or NCCL. The code differences between the two are simple. You can use your own communication layer in a very similar manner.

Once you define the communicator pointer, passed to cuDSS via CUDSS_DATA_COMM in your code, as shown in the previous code snippet, there is no other need to use any communication layer function unless your code specifically needs it. cuDSS uses the defined communication layer “under the covers” so you don’t need to code for it. Look through the example code for how more than one node is used.

For implementing your own communication layer, a good introductory discussion can be found in the cuDSS documentation under advanced topics.

A high-level overview of the communication layers requirements are below:

The MGMN mode is enabled by abstracting away all communication-specific primitives into a small, separately built shim communication layer.
- Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).
The enabled MGMN execution in cuDSS does not require any changes for applications that do not make use of the MGMN mode.
MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, dense righthand side, or solution, using the cudssMatrixSetDistributedRow1D() function (see the next paragraph).

cuDSS MGMN mode optionally accepts pre-distributed input and can optionally create distributed output. You can have both A and B on the rank 0 device. In which case, cuDSS will distribute them, or you can tell cuDSS how the data is distributed across the devices and nodes with the cudssMatrixSetDistributedRow1d() function.The developer will have to make sure the data is in the proper location on the proper node and device.

A critical step for good performance is to carefully choose your CPU:GPU:NIC bindings. This is not discussed here but is documented elsewhere.

There are some current limitations to MGMN mode,

MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.
MGMN mode does not support matrix batches.
All phases in MGMN mode are synchronous.

Takeaways

Sparse linear systems appear in many disciplines. Fueled by the need to solve real-life problems, the overall size of the problems is growing at a very rapid rate. Developers must find ways to solve them both efficiently and quickly. NVIDIA cuDSS provides an easy to use library for solving increasingly massive problems using NVIDIA GPUs.

For more features you can use with cuDSS, it is recommended that you read through the advanced feature section of the documents. They contain more information about the features presented here as well as other capabilities to help you solve your large sparse linear problems. There is also a section that explains how to do logging with cuDSS as you develop your code. This is a great resource since debugging parallel code can be challenging. cuDSS has some great capability for getting log information as your code is executed.

Subscribe to cuDSS on the customer page to remain updated on most recent innovations.

Solving Large-Scale Linear Sparse Problems with NVIDIA cuDSS

Getting started

Rethink your data types: Why INT64 matters now

Hybrid memory mode—blurring the line between CPU and GPU

Multiply your muscle: multi-GPU mode (MG mode)

Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power

Takeaways

Tags

About the Authors

Solving Large-Scale Linear Sparse Problems with NVIDIA cuDSS

Getting started

Rethink your data types: Why INT64 matters now

Hybrid memory mode—blurring the line between CPU and GPU

Multiply your muscle: multi-GPU mode (MG mode)

Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power

Takeaways

Tags

About the Authors

Comments

Related posts

NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing

Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries

Just Released: cuDSS 0.3.0

Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS

Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

Related posts

Just Released: NVIDIA HPC SDK v25.7

How New GB300 NVL72 Features Provide Steady Power for AI

AI Helps Locate Dangerous Fishing Nets Lost at Sea

Networking Reliability and Observability at Scale with NCCL 2.24

NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing