Reducing CUDA Binary Size to Distribute cuML on PyPI

Starting with the 25.10 release, pip-installable cuML wheels can now be downloaded directly from PyPI. No more complex installation steps or managing Conda environments—just straightforward pip installation like any other Python package.

The NVIDIA team has been working hard to make cuML more accessible and efficient across the board. One of the biggest challenges has been managing the binary size of our CUDA C++ libraries, which affects user experience as well as the ability to pip install from PyPI. Distributing wheels on pypi.org reaches a broader audience and enables users in a corporate setting to have the wheels available on internal pypi.org mirrors.

PyPI limits binary size to keep costs for the Python Software Foundation (PSF) under control and protect users from downloading unexpectedly large binaries. The complexity of the cuML library has historically required a larger binary than PyPI could host, but we’ve worked closely with PSF to overcome this by reducing binary size.

This post walks you through the new pip install path for cuML and a tutorial on the steps the team used to drop the CUDA C++ library binary size, which enabled the availability of cuML wheels on PyPI.

Installing cuML from PyPI

To install cuML from PyPI, use the following commands based on your system’s CUDA version. These packages have been optimized for compatibility and performance.

CUDA 13

Wheel size: ~250 MB

pip install cuml-cu13

CUDA 12

Wheel size: ~470 MB

pip install cuml-cu12

How the cuML team reduced binary size by ~30%

By applying careful optimization techniques, the NVIDIA team successfully reduced the CUDA 12 libcuml dynamic shared object (DSO) size from approximately 690 MB to 490 MB—a reduction of nearly 200 MB or ~30%.

Smaller binaries provide:

Faster downloads from PyPI
Reduced storage requirements for users
Quicker container builds for deployment
Lower bandwidth costs for distribution

Reducing binary size required a systematic approach to identifying and eliminating bloat in the CUDA C++ codebase. Later in the post, we share the techniques used to accomplish this, which can benefit any team working with CUDA C++ libraries. We hope that these methods will help library developers manage the size of their binaries and promote the ecosystem of CUDA C++ libraries to move toward more manageable binary sizes.

Why are CUDA binaries so large?

If you’ve ever shipped CUDA C++ code as a compiled binary, you’ve likely noticed that these libraries are significantly larger than equivalent C++ libraries offering similar features. CUDA C++ libraries contain numerous kernels (GPU functions) that form the bulk of the binary size. Each kernel instantiation is essentially a cross product of:

All template parameters used in the code
Real GPU architectures that the library supports, compiled in the form of Real-ISA machine code which is the final binary format used for executing CUDA code

As you add more features and support newer architectures, binary sizes can quickly become intractable. Even supporting just a single architecture results in binaries considerably larger than CPU-only libraries with the same feature set.

Note that the techniques shared here aren’t a panacea for all binary size issues, and they don’t cover every possible optimization method. We’re highlighting some of the better practices that worked for us at cuML and other RAPIDS libraries like RAFT and cuVS. Keep in mind that examples are somewhat general and developers must often consider tradeoffs between binary size and runtime performance.

Understanding CUDA Whole Compilation mode

Before diving into solutions, it’s crucial to understand how CUDA compilation works by default.

CUDA C++ libraries are typically compiled in Whole Compilation mode. This means that every Translation Unit (TU)—that is, each .cu source file—that directly launches a kernel with triple chevron syntax (kernel<<<...>>>) includes a copy of the kernel. While the standard C++ link process removes duplicate symbols from the final binary, the CUDA C++ link process keeps all copies of a kernel compiled into a TU.

Graphic with three parts labeled (left to right) Source Files, Compilation, and Final Binary with arrows between and code. — *Figure 1. Duplicate kernel instances linked in final binary*

To check whether there are duplicate kernel instantiations in your DSO, you can run the following command:

cuobjdump -symbols libcuml.so | grep STO_ENTRY | sort -b | uniq -c | sort -gb

Note: While enabling CUDA Separable Compilation can filter out duplicate kernels, it’s not a complete solution. In fact, enabling it by default may actually increase binary size and link time in some cases. For more details, see Build CUDA Software at the Speed of Light.

Removing duplicate kernel instances programmatically

The key to solving this problem is to separate the kernel function definition from the declaration, ensuring each kernel is compiled in exactly one TU. Here’s how to structure this:

Function declaration (kernel.hpp):

namespace library {

void kernel_launcher();

}

Function and kernel compilation in one TU only (kernel.cu):

#include <library/kernel.hpp>


__global__ void kernel() {
 /// code body
}

void kernel_launcher() {
 kernel<<<...>>>();
}

Requesting kernel execution (example.cu):

#include <library/kernel.hpp>

namespace library {
 kernel_launcher();
}

By separating the kernel function definition from the declaration, it’s compiled in one TU and a launcher construct is used to call it from other TUs. This host-side wrapper is necessary because adding a kernel function body in one TU and including the header of the kernel definition in another TU to launch the kernel is not allowed.

Optimizing shared kernel function templates in header files

If you’re shipping a header-only CUDA C++ library or a compiled binary with shared utility kernels using function templates, you face a challenge: function templates are instantiated at the call site.

Anti-pattern: Implicit template instantiation

Consider a kernel that supports both row-major and column-major 2D array layouts:

namespace library {

namespace {

template <typename T>
__global__ void kernel_row_major(T* ptr) {
 // code body
}

template <typename T>
__global__ void kernel_col_major(T* ptr) {
 // code body
}

}

template <typename T>
void kernel_launcher(T* ptr, bool is_row_major) {
 if (is_row_major) {
     kernel_row_major<<<...>>>(ptr);
 }
 else {
     kernel_col_major<<<...>>>(ptr);
 }
}

}

This approach provides instances of both kernels to every TU that calls kernel_launcher, regardless of whether the user needs both.

Pattern: Explicit template parameters

The solution is to expose compile-time information as template parameters:

namespace library {

namespace {

template <typename T>
__global__ void kernel_row_major(T* ptr) {
 // code body
}

template <typename T>
__global__ void kernel_col_major(T* ptr) {
 // code body
}

}

template <typename T, bool is_row_major>
void kernel_launcher(T* ptr) {
 if constexpr (is_row_major) {
        kernel_row_major<<<...>>>(ptr);
 }
 else {
     kernel_col_major<<<...>>>(ptr);
 }
}

}

This approach introduces intentionality. If users require both kernel instances, they can instantiate them explicitly. However, most downstream libraries will generally only need one, significantly reducing binary size.

Note: This method also enables faster compilation and increased runtime performance as a result of compiling the smallest possible form of the kernel function template with a constrained set of template parameters. It also enables you to bake-in compile-time optimizations based on instantiated templates.

Optimizing kernel function templates in source files

Even after eliminating duplicate kernel instances, there’s more work to do for massive kernels with multiple template types.

Anti-pattern: Template parameters for runtime arguments

When compiling binaries, introducing template parameters unnecessarily creates multiple kernel instances. This is the opposite approach compared to writing function templates in header files, where more templates are desirable.

Example (detail/kernel.cuh):

namespace {

template <typename T, typename Lambda>
__global__ void kernel(T* ptr, Lambda lambda) {
 lambda(ptr);
}

}

Usage (example.cu):

namespace library {

template <typename T>
void kernel_launcher(T* ptr) {
 if (some_conditional) {
     kernel<<<...>>>(ptr, lambda_type_1{});
 }
 else {
     kernel<<<...>>>(ptr, lambda_type_2{});
 }
}

}

This approach inevitably creates two instances of the kernel in the precompiled binary.

Pattern: Convert templates to runtime arguments

When writing kernel function templates, always ask: “Can this template argument be converted to a runtime argument?” Whenever the answer is yes, refactor as follows:

Definition (detail/kernel.cuh):

enum class LambdaSelector {
 lambda_type_1,
 lambda_type_2
};

template <typename T>
struct lambda_type_1 {
 void operator()(T* val) {
     // do some op
 }
};

template <typename T>
struct lambda_type_2 {
 void operator()(T* val) {
     // do some other op
 }
};

namespace {

template <typename T>
__global__ void kernel(T* ptr, LambdaSelector lambda_selector) {
 if (lambda_selector == LambdaSelector::lambda_type_1) {
     lambda_type_1<T>{}(ptr);
 }
 else if (lambda_selector == LambdaSelector::lambda_type_2){
     lambda_type_2<T>{}(ptr);
 }
}

}

Usage (example.cu):

namespace library {

template <typename T>
void kernel_launcher(T* ptr) {
 if (some_conditional) {
     kernel<<<...>>>(ptr, LambdaSelector::lambda_type_1);
 }
 else {
     kernel<<<...>>>(ptr, LambdaSelector::lambda_type_2);
 }
}

}

Now only one kernel instance is shipped, directly reducing the binary size to almost half its original size. The impact of converting template arguments to runtime arguments scales with the factor: 1 / (cross product of template instantiations removed).

Note: This method enables faster compilation but may come at the cost of some runtime performance due to added kernel complexity and fewer compile-time optimizations.

Get started with cuML on PyPI

We’re excited to bring cuML to PyPI. We hope the techniques shared here can help other teams working with CUDA C++ to achieve similar results and, when building Python interfaces, share their work on PyPi.

For more tips as you build libraries on CUDA C++, check out the updated CUDA Programming Guide. To get started with CUDA, see An Even Easier Introduction to CUDA.

Reducing CUDA Binary Size to Distribute cuML on PyPI

Installing cuML from PyPI

CUDA 13

CUDA 12

How the cuML team reduced binary size by ~30%

Why are CUDA binaries so large?

Understanding CUDA Whole Compilation mode

Removing duplicate kernel instances programmatically

Optimizing shared kernel function templates in header files

Anti-pattern: Implicit template instantiation

Pattern: Explicit template parameters

Optimizing kernel function templates in source files

Anti-pattern: Template parameters for runtime arguments

Pattern: Convert templates to runtime arguments

Get started with cuML on PyPI

Tags

About the Authors

Reducing CUDA Binary Size to Distribute cuML on PyPI

Installing cuML from PyPI

CUDA 13

CUDA 12

How the cuML team reduced binary size by ~30%

Why are CUDA binaries so large?

Understanding CUDA Whole Compilation mode

Removing duplicate kernel instances programmatically

Optimizing shared kernel function templates in header files

Anti-pattern: Implicit template instantiation

Pattern: Explicit template parameters

Optimizing kernel function templates in source files

Anti-pattern: Template parameters for runtime arguments

Pattern: Convert templates to runtime arguments

Get started with cuML on PyPI

Tags

About the Authors

Comments

Related posts

Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants

Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

RAPIDS Brings Zero-Code-Change Acceleration, IO Performance Gains, and Out-of-Core XGBoost

RAPIDS 24.12 Introduces cuDF on PyPI, CUDA Unified Memory for Polars, and Faster GNNs

Input and Output Configurability in RAPIDS cuML

Related posts

AI in Manufacturing and Operations at NVIDIA: Accelerating ML Models with NVIDIA CUDA-X Data Science

GPU-Accelerated Single-Cell RNA Analysis with RAPIDS-singlecell

Limit Order Book Dataset Generation for Accelerated Short-Term Price Prediction with RAPIDS

Debugging a Mixed Python and C Language Stack

Topic Modeling and Image Classification with Dataiku and NVIDIA Data Science