Starting with the 25.10 release, pip-installable cuML wheels can now be downloaded directly from PyPI. No more complex installation steps or managing Conda environments—just straightforward pip installation like any other Python package.
The NVIDIA team has been working hard to make cuML more accessible and efficient across the board. One of the biggest challenges has been managing the binary size of our CUDA C++ libraries, which affects user experience as well as the ability to pip install from PyPI. Distributing wheels on pypi.org reaches a broader audience and enables users in a corporate setting to have the wheels available on internal pypi.org mirrors.
PyPI limits binary size to keep costs for the Python Software Foundation (PSF) under control and protect users from downloading unexpectedly large binaries. The complexity of the cuML library has historically required a larger binary than PyPI could host, but we’ve worked closely with PSF to overcome this by reducing binary size.
This post walks you through the new pip install path for cuML and a tutorial on the steps the team used to drop the CUDA C++ library binary size, which enabled the availability of cuML wheels on PyPI.
Installing cuML from PyPI
To install cuML from PyPI, use the following commands based on your system’s CUDA version. These packages have been optimized for compatibility and performance.
CUDA 13
Wheel size: ~250 MB
pip install cuml-cu13
CUDA 12
Wheel size: ~470 MB
pip install cuml-cu12
How the cuML team reduced binary size by ~30%
By applying careful optimization techniques, the NVIDIA team successfully reduced the CUDA 12 libcuml dynamic shared object (DSO) size from approximately 690 MB to 490 MB—a reduction of nearly 200 MB or ~30%.
Smaller binaries provide:
- Faster downloads from PyPI
- Reduced storage requirements for users
- Quicker container builds for deployment
- Lower bandwidth costs for distribution
Reducing binary size required a systematic approach to identifying and eliminating bloat in the CUDA C++ codebase. Later in the post, we share the techniques used to accomplish this, which can benefit any team working with CUDA C++ libraries. We hope that these methods will help library developers manage the size of their binaries and promote the ecosystem of CUDA C++ libraries to move toward more manageable binary sizes.
Why are CUDA binaries so large?
If you’ve ever shipped CUDA C++ code as a compiled binary, you’ve likely noticed that these libraries are significantly larger than equivalent C++ libraries offering similar features. CUDA C++ libraries contain numerous kernels (GPU functions) that form the bulk of the binary size. Each kernel instantiation is essentially a cross product of:
- All template parameters used in the code
- Real GPU architectures that the library supports, compiled in the form of Real-ISA machine code which is the final binary format used for executing CUDA code
As you add more features and support newer architectures, binary sizes can quickly become intractable. Even supporting just a single architecture results in binaries considerably larger than CPU-only libraries with the same feature set.
Note that the techniques shared here aren’t a panacea for all binary size issues, and they don’t cover every possible optimization method. We’re highlighting some of the better practices that worked for us at cuML and other RAPIDS libraries like RAFT and cuVS. Keep in mind that examples are somewhat general and developers must often consider tradeoffs between binary size and runtime performance.
Understanding CUDA Whole Compilation mode
Before diving into solutions, it’s crucial to understand how CUDA compilation works by default.
CUDA C++ libraries are typically compiled in Whole Compilation mode. This means that every Translation Unit (TU)—that is, each .cu source file—that directly launches a kernel with triple chevron syntax (kernel<<<...>>>) includes a copy of the kernel. While the standard C++ link process removes duplicate symbols from the final binary, the CUDA C++ link process keeps all copies of a kernel compiled into a TU.

To check whether there are duplicate kernel instantiations in your DSO, you can run the following command:
cuobjdump -symbols libcuml.so | grep STO_ENTRY | sort -b | uniq -c | sort -gb
Note: While enabling CUDA Separable Compilation can filter out duplicate kernels, it’s not a complete solution. In fact, enabling it by default may actually increase binary size and link time in some cases. For more details, see Build CUDA Software at the Speed of Light.
Removing duplicate kernel instances programmatically
The key to solving this problem is to separate the kernel function definition from the declaration, ensuring each kernel is compiled in exactly one TU. Here’s how to structure this:
Function declaration (kernel.hpp):
namespace library {
void kernel_launcher();
}
Function and kernel compilation in one TU only (kernel.cu):
#include <library/kernel.hpp>
__global__ void kernel() {
/// code body
}
void kernel_launcher() {
kernel<<<...>>>();
}
Requesting kernel execution (example.cu):
#include <library/kernel.hpp>
namespace library {
kernel_launcher();
}
By separating the kernel function definition from the declaration, it’s compiled in one TU and a launcher construct is used to call it from other TUs. This host-side wrapper is necessary because adding a kernel function body in one TU and including the header of the kernel definition in another TU to launch the kernel is not allowed.
Optimizing shared kernel function templates in header files
If you’re shipping a header-only CUDA C++ library or a compiled binary with shared utility kernels using function templates, you face a challenge: function templates are instantiated at the call site.
Anti-pattern: Implicit template instantiation
Consider a kernel that supports both row-major and column-major 2D array layouts:
namespace library {
namespace {
template <typename T>
__global__ void kernel_row_major(T* ptr) {
// code body
}
template <typename T>
__global__ void kernel_col_major(T* ptr) {
// code body
}
}
template <typename T>
void kernel_launcher(T* ptr, bool is_row_major) {
if (is_row_major) {
kernel_row_major<<<...>>>(ptr);
}
else {
kernel_col_major<<<...>>>(ptr);
}
}
}
This approach provides instances of both kernels to every TU that calls kernel_launcher, regardless of whether the user needs both.
Pattern: Explicit template parameters
The solution is to expose compile-time information as template parameters:
namespace library {
namespace {
template <typename T>
__global__ void kernel_row_major(T* ptr) {
// code body
}
template <typename T>
__global__ void kernel_col_major(T* ptr) {
// code body
}
}
template <typename T, bool is_row_major>
void kernel_launcher(T* ptr) {
if constexpr (is_row_major) {
kernel_row_major<<<...>>>(ptr);
}
else {
kernel_col_major<<<...>>>(ptr);
}
}
}
This approach introduces intentionality. If users require both kernel instances, they can instantiate them explicitly. However, most downstream libraries will generally only need one, significantly reducing binary size.
Note: This method also enables faster compilation and increased runtime performance as a result of compiling the smallest possible form of the kernel function template with a constrained set of template parameters. It also enables you to bake-in compile-time optimizations based on instantiated templates.
Optimizing kernel function templates in source files
Even after eliminating duplicate kernel instances, there’s more work to do for massive kernels with multiple template types.
Anti-pattern: Template parameters for runtime arguments
When compiling binaries, introducing template parameters unnecessarily creates multiple kernel instances. This is the opposite approach compared to writing function templates in header files, where more templates are desirable.
Example (detail/kernel.cuh):
namespace {
template <typename T, typename Lambda>
__global__ void kernel(T* ptr, Lambda lambda) {
lambda(ptr);
}
}
Usage (example.cu):
namespace library {
template <typename T>
void kernel_launcher(T* ptr) {
if (some_conditional) {
kernel<<<...>>>(ptr, lambda_type_1{});
}
else {
kernel<<<...>>>(ptr, lambda_type_2{});
}
}
}
This approach inevitably creates two instances of the kernel in the precompiled binary.
Pattern: Convert templates to runtime arguments
When writing kernel function templates, always ask: “Can this template argument be converted to a runtime argument?” Whenever the answer is yes, refactor as follows:
Definition (detail/kernel.cuh):
enum class LambdaSelector {
lambda_type_1,
lambda_type_2
};
template <typename T>
struct lambda_type_1 {
void operator()(T* val) {
// do some op
}
};
template <typename T>
struct lambda_type_2 {
void operator()(T* val) {
// do some other op
}
};
namespace {
template <typename T>
__global__ void kernel(T* ptr, LambdaSelector lambda_selector) {
if (lambda_selector == LambdaSelector::lambda_type_1) {
lambda_type_1<T>{}(ptr);
}
else if (lambda_selector == LambdaSelector::lambda_type_2){
lambda_type_2<T>{}(ptr);
}
}
}
Usage (example.cu):
namespace library {
template <typename T>
void kernel_launcher(T* ptr) {
if (some_conditional) {
kernel<<<...>>>(ptr, LambdaSelector::lambda_type_1);
}
else {
kernel<<<...>>>(ptr, LambdaSelector::lambda_type_2);
}
}
}
Now only one kernel instance is shipped, directly reducing the binary size to almost half its original size. The impact of converting template arguments to runtime arguments scales with the factor: 1 / (cross product of template instantiations removed).
Note: This method enables faster compilation but may come at the cost of some runtime performance due to added kernel complexity and fewer compile-time optimizations.
Get started with cuML on PyPI
We’re excited to bring cuML to PyPI. We hope the techniques shared here can help other teams working with CUDA C++ to achieve similar results and, when building Python interfaces, share their work on PyPi.
For more tips as you build libraries on CUDA C++, check out the updated CUDA Programming Guide. To get started with CUDA, see An Even Easier Introduction to CUDA.