Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants

If you’ve ever installed an NVIDIA GPU-accelerated Python package, you’ve likely encountered a familiar dance: navigating to pytorch.org, jax.dev, rapids.ai, or a similar site to find the artifact built for your NVIDIA CUDA version. You then copy a custom pip, uv, or other installer command with a special index URL or special package name such as nvidia-<package>-cu{11-12}. This isn’t just an inconvenience; it represents a fundamental limitation in how Python packages handle hardware diversity in the modern computing landscape.

The current wheel format, designed with CPU computing and relatively homogeneous computing, struggles with today’s heterogeneous computing reality. To address this problem (and a few others), NVIDIA initiated the WheelNext open source initiative. The initiative aims to improve the user experience in the Python packaging ecosystem to better address scientific computing, AI, and high-performance computing (HPC) use cases. It represents a major commitment in the open source space to evolve and improve the Python ecosystem that so many depend on. Check out the WheelNext GitHub repo.

In collaboration with Meta, Astral and Quansight, NVIDIA is today releasing experimental support in PyTorch 2.8.0 for a newly developed format called Wheel Variant. This new format enables you to describe Python artifacts to a very fine degree and decide at install which artifact best fits the platform. This post provides a feature preview that explains the proposed changes and how they work in the real world.

What are the technical challenges with CUDA compatibility?

The Python wheel format uses tags to identify compatible platforms: Python version, ABI, and platform, such as cp313-cp313-linux_x86_64. While these tags work well for CPU-based packages, they lack the granularity needed for specialized builds: GPUs, or specific CPU instructions AVX512, ARMv9, and so on. For example, a single linux_x86_64 tag tells nothing about any additional hardware required to run a GPU-enabled package. This granularity gap forces package maintainers into suboptimal distribution strategies.

Adding to this complexity is the often-misunderstood relationship between different CUDA components, which can often lead to additional difficulties. These components include:

Kernel Mode Driver (KMD): The low-level firmware driver (nvidia.ko on Linux) interfacing NVIDIA hardware with the OS kernel.
CUDA User Mode Driver (UMD): The user-mode driver (libcuda.so on Linux) used to write and run code on NVIDIA GPUs.
CUDA Runtime: The high-level user-facing API (for example, cudaMemcpy) that most CUDA libraries and applications use (libcudart.so on Linux).
CUDA Toolkit: The complete development environment including compilers, libraries, and tools.

Each component has different compatibility rules that are central to the distribution problem.

What is the Wheel Variant format?

The Wheel Variant format is a soon-to-be proposed Python packaging standard aiming to evolve Python packaging for the heterogeneous computing era. Wheel Variants extend the current wheel format to enable multiple wheels for the same package version, Python ABI, and platform, each optimized for specific hardware configurations.

Try it:

# Linux
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh sh

# Windows
powershell -c { $env:INSTALLER_DOWNLOAD_URL = 'https://wheelnext.astral.sh'; irm https://astral.sh/uv/install.ps1 | iex }

uv pip install torch

A collaborative solution by and for the scientific computing Python communities

Instead of focusing on the lowest common denominator, Python artifacts can now be specialized for very specific hardware and enable significant improvements in end-user experience and performance.

The Wheel Variant design proposes an elegantly simple yet powerful syntax to specify, identify, and describe each artifact with “variant properties” following a standardized format:

namespace :: feature :: value

A few examples of variant properties include:

nvidia :: cuda_version_lower_bound :: 12.0 specifies CUDA User-Mode Driver >=12.0
nvidia :: sm_arch :: 100_real specifies a package built with CMAKE flag for NVIDIA GPU “real architecture 100”
x86_64 :: level :: v3 specifies x86-64-v3 CPU architecture support
x86_64 :: avx512_bf16 :: 1 specifies use of X86_64 instruction: AVX512-BF16
aarch64 :: version :: 8.1a specifies ARM architecture version 8.1a

Each variant, or configuration, is then uniquely identified using either a custom label, manually provided at build time, or an automatically generated SHA-256 hash of the variant properties.

The label or hash is then incorporated in the wheel filename as follows:

torch-2.8.0-cp313-cp313-linux_x86_64-cu128.whl     # Custom label
torch-2.8.0-cp313-cp313-linux_x86_64-a7f3c2d9.whl  # Hash-based

This design ensures the following properties:

Avoids name conflicts for identical Python platforms by providing a unique identifier.
An optional human readable label, which can highlight the intended purpose of the variant
(for example, cu128 indicates built for CUDA 12.8 or above).
Will never match the previous Wheel Regex, which guarantees that variant wheels do not confuse a non-variant enabled Python package installer.

How does the plugin architecture work?

The magic happens through provider plugins, which are purpose-specific modules that detect local software and hardware capabilities and configurations. They analyze the local environment and guide package selection.

When you run [uv] pip install torch, the installer queries installed plugins to understand your system’s capabilities.

The different variant plugins declared might detect that you have:

CUDA driver 12.9 installed
An NVIDIA RTX 4090 GPU (compute capability 8.9)
Support for specific CPU instructions: (for example, AVX512-BF16)

Based on this information, the installer automatically selects the optimal wheel variant:

No more manual CUDA version selection
No more downloading the incorrect “flavor” of PyTorch
No more guessing; the best fitting package is automatically installed

Flow chart showing (left to right) User runs > Plugin detects > Installer selects. — *Figure 1.* *Wheel Variant installation workflow*

Crucially, Wheel Variants maintain full backward compatibility. Older pip versions that don’t understand variants simply ignore them, ensuring existing infrastructure continues to work. The metadata lives in three places:

pyproject.toml for build configuration
variant.json inside the wheel
*-variants.json on the package index for efficient variant discovery

This design allows gradual ecosystem adoption without breaking changes.

Example NVIDIA GPU-specific implementation

The NVIDIA variant plugin implements a priority system to handle the complexity of GPU environments based on the most common pain points observed in GPU package installation:

Priority 1 (P1) – libcuda (User-Mode Driver: UMD) Version Detection: The most critical feature. The UMD version determines which CUDA runtime versions can be used. Mismatches here are the most common source of installation failures today.
Priority 2 (P2) – Compute Capability: Determines whether a wheel contains any binary code that is compatible with the architecture of the GPU on the system the wheel is being installed onto.

To provide an example, consider an NVIDIA GPU user with CUDA driver 12.8 that runs:

[uv] pip install torch

What’s happening behind the scenes?

The NVIDIA plugin detects the driver version and compute capability.
It queries available variants for PyTorch.
It finds variants such as:
- torch-2.8.0-...-00000000.whl – CPU-only build – fallback if nothing else matches: called null-variant.
- torch-2.8.0-...-cu126.whl – CTK 12.6, compatible with LibCUDA 12.0 and above
- torch-2.8.0-...-cu128.whl – CTK 12.8, compatible with LibCUDA 12.8 and above
- torch-2.8.0-...-cu129.whl – CTK 12.9, compatible with LibCUDA 12.9 and above
It selects the CUDA 12.8 variant as it is the best match, downloads and installs it.

The system handles edge cases elegantly. If a CUDA environment is not detected, it can fall back to the null variant.

What are the ecosystem benefits for Python package end users?

The benefits for Python package users are immediate and substantial:

Zero-configuration installation: No need to visit pytorch.org (or other package selectors such as RAPIDS, JAX, and so on) to figure out the best match package. Just use [uv] pip install <package> and it works.
Optimal performance by default: You automatically get code compiled for your specific hardware, not a lowest-common-denominator build.
User control when needed: Power users can still override selection with --no-variant to force generic wheels or <package>#<label_name> to pick a specific variant.

What are the ecosystem benefits for package maintainers?

Wheel Variants solve long-standing distribution headaches for package maintainers, including:

Simplified release matrix: Instead of maintaining separate packages (torch-cpu, torch-cu126, torch-cu128, torch-cu129, and so on) or multiple indexes, maintainers publish variants of a single package.
Targeted optimization: Build variants optimized for specific architectures without worrying about bloating the download for everyone else.
Future-proof architecture: When NVIDIA releases a new GPU architecture, you can add a new variant without breaking existing users.
Reduced support burden: Fewer installation issues mean less time spent helping users navigate CUDA compatibility.

What are the ecosystem benefits for NVIDIA GPU users?

Wheel Variants address several strategic user experience issues, including:

Solving dependency chains: The system can express variant-specific dependencies. For example, torch can declare variant-specific dependencies, which guarantees uniform metadata across all builds. A much desired feature by many package installers such as uv and poetry.
Ecosystem coherence: All NVIDIA libraries (cuDNN, cuBLAS, NCCL) can use consistent variant schemes, making the entire stack more predictable.
Innovation enablement: New NVIDIA GPU features can be exposed through new variants without waiting for ecosystem-wide coordination.

What are the broader applications of Wheel Variants beyond PyTorch and NVIDIA GPU computing?

The impact of Wheel Variants extends far beyond PyTorch and NVIDIA GPU computing. Several real-world applications are detailed below.

Research computing environments

In academic settings, NVIDIA GPU clusters often have heterogeneous hardware. Some nodes use NVIDIA A100 GPUs, others use NVIDIA H100 GPUs, and newer nodes use NVIDIA GB200 GPUs. Today, researchers must carefully manage environment modules and installation scripts.

With Wheel Variants:

# On any node, regardless of NVIDIA GPU available
# Results: Automatically gets A100, H100, or B200 optimized and compatible builds
uv pip install torch transformers accelerate

The system caching ensures platform detection happens once, not on every install.

Wheel size optimization

Docker images for ML and AI workloads are notoriously large, often exceeding 10 GB. Much of this comes from bundled GPU libraries for all possible architectures.

With Wheel Variants, a future of sharded wheels could be possible (if the package maintainers decide to shard their packages).

# Build stage can target specific variants
FROM nvidia/cuda:12.9.1-base-ubuntu24.04

# Sharded wheels could lead to significantly smaller docker images
RUN uv pip install torch#sm120  # Select label `sm120

For organizations deploying hundreds of container instances, the bandwidth and storage savings multiply quickly.

Problem solving across the Python ecosystem

While the NVIDIA GPU use case has been highlighted in this post, Wheel Variants solve problems across the Python ecosystem, including:

SciPy can offer variants built against different BLAS implementations (OpenBLAS versus Intel MKL versus Apple Accelerate).
Web assembly projects can provide variants with or without threading support (wasm32-wasi with and without pthreads).
MPI applications like mpi4py can target specific MPI implementations (OpenMPI versus MPICH versus Intel MPI).
Game development libraries can ship variants with different graphics backends (Vulkan versus DirectX versus Metal).

The variant system is extensible to any hardware or software capability that can be expressed through the variant semantic:

namespace :: feature :: value

How to build Wheel Variants

This section will cover the project maintainer user story and highlight the workflow envisioned to build and publish wheel variants. There are many possible strategies, ranging from using the repackaging CLI tool to direct integration within most of the major build frontends and backends.

Convert a Standard Wheel into a Wheel Variant

To facilitate the transition, variantlib offers a CLI tool to convert an existing artifact into a Wheel Variant. It’s useful to be able to ship today experimental Wheel Variant packages without deploying an entire new process. For more details, see the wheelnext/variantlib GitHub repo.

variantlib make-variant \
   -f "torch-2.8.0-cp313-cp313-manylinux_2_28_x86_64.whl" \
   -o "output_dir/" \
   --pyproject_toml "torch/pyproject.toml" \
   --variant-label "cu129" \
   --property "nvidia :: cuda_version_lower_bound :: 12.9"

Flit and Flit-Core

Flit does not support building Python C-Extensions; however, it can be useful for package prebuilt artifacts. For more details, see the wheelnext/flit GitHub repo.

# Build a variant for x86-64-v3 architecture
flit build --format wheel \
   --variant-property "x86_64 :: level :: v3" \
   --variant-label "x8664_v3"

# Build a variant for ARM AArch64 8.1a architecture
flit build --format wheel \
   --variant-property "aarch64 :: version :: 8.1a" \
   --variant-label "arm_81a"

# Build a CUDA >=12.8 variant with custom label
flit build --format wheel \
   --variant-property "nvidia :: cuda_version_lower_bound :: 12.8" \
   --variant-label "cu128"
# Build a CUDA >=12.0,<13 variant with custom label
flit build --format wheel \
   --variant-property "nvidia :: cuda_version_lower_bound :: 12" \
   --variant-property "nvidia :: cuda_version_upper_bound :: 13" \
   --variant-label "cu12"

# Build a null variant (no variant match => fallback)
flit build --format wheel --null-variant

Hatch and Hatchling

Hatch and Hatchling also do not support building Python C-Extensions but can be useful for package prebuilt artifacts. For more details, see the wheelnext/hatch GitHub repo.

# Build a variant for x86-64-v3 architecture
hatch build --target wheel \
   --variant-property "x86_64 :: level :: v3" \
   --variant-label "x8664_v3"

# Build a variant for ARM AArch64 8.1a architecture
hatch build --target wheel \
   --variant-property "aarch64 :: version :: 8.1a" \
   --variant-label "arm_81a"

# Build a CUDA >=12.8 variant with custom label
hatch build --target wheel \
   --variant-property "nvidia :: cuda_version_lower_bound :: 12.8" \
   --variant-label "cu128"
# Build a CUDA >=12.0,<13 variant with custom label
hatch build --target wheel \
   --variant-property "nvidia :: cuda_version_lower_bound :: 12" \
   --variant-property "nvidia :: cuda_version_upper_bound :: 13" \
   --variant-label "cu12"

# Build a null variant (no variant match => fallback)
hatch build --target wheel --null-variant

Meson-Python

Package maintainers can start experimenting with variants using the modified meson-python build backend:

# Build a variant for x86-64-v3 architecture
python -m build -w -Cvariant="x86_64 :: level :: v3" -Cvariant-label=x8664_v3

# Build a variant for ARM AArch64 8.1a architecture
python -m build -w -Cvariant="aarch64 :: version :: 8.1a" -Cvariant-label=arm_81a

# Build a CUDA 12.8+ variant with custom label
python -m build -w -Cvariant="nvidia :: cuda_version_lower_bound :: 12.8" -Cvariant-label=cu128

# Build a null variant (no variant match => fallback)
python -m build -w -Cnull-variant

The build system passes variant information to the compilation process, enabling targeted optimization while maintaining a single source tree. For more details, see the wheelnext/meson-python GitHub repo.

Scikit-Build-Core

This work is in progress—stay tuned for updates and more information.

What is the implementation road map for Wheel Variants?

The Wheel Variants initiative is moving from concept to reality through careful, collaborative development.

PyTorch 2.8.0: Will include experimental support for Wheel Variants. This is explicitly a testing release from the PyTorch team, in partnership with the open source WheelNext project, Quansight, Astral, and NVIDIA. This experiment will allow the team to gather real-world feedback before the Python Enhancement Proposal (PEP) is published.
PEP is in draft status: Community review and refinement ensures all stakeholders can contribute to shaping the standard.
Reference implementations exist: The variantlib library provides the core functionality, while a prototype uv and pip implementation demonstrates the operating model and co-existence with existing tools and ecosystem.
The NVIDIA plugin is under active development: Implements the priority system described in this post.

Looking ahead, the road map prioritizes ecosystem compatibility and gradual adoption.

Near term: Experimental phase with PyTorch 2.8, gathering feedback
Mid term: PEP finalization based on real-world experience
Long term: Broader tool support (installers, build backends and frontends, package indexes) and ecosystem-wide adoption

Importantly, this is a collaborative effort. Success depends on a coordinated effort from all corners of the scientific Python community recognizing the value and adopting the standard.

Conclusion

Wheel Variants represent more than a technical improvement. They’re a fundamental evolution in how Python packages handle our increasingly diverse computing landscape. For the NVIDIA ecosystem, they solve immediate pain points while establishing GPU computing as a fundamental aspect of Python packaging.

The collaboration between PyTorch, Astral, Quansight, NVIDIA, and the WheelNext project demonstrates the Python community at its best: identifying shared challenges and building solutions that benefit everyone. The PyTorch 2.8 experimental release is at the threshold of a new era in Python packaging, one where [uv] pip install <package> just works in an optimized fashion for your exact local hardware and software configuration.

We invite you to join this journey. Test the experimental releases, provide feedback, and help shape the future of Python packaging. Together, we can ensure that the next generation of Python developers never has to manually select a CUDA version again.

For more information and to get involved, visit wheelnext.dev and watch for PyTorch 2.8 experimental Wheel Variants support.

We encourage you to read and further your understanding of the Wheel Variant proposal with the blog posts of our collaborators:

PyTorch Foundation: PyTorch Wheel Variants, the Frontier of Python Packaging
Astral: A Variant-Enabled Build of uv
Quansight: Python Wheels: from Tags to Variants

Get involved

The Wheel Variants initiative needs community participation to succeed. Here’s how different groups can contribute:

Developers and users:

Test PyTorch 2.8 experimental support when it releases. Real-world feedback is invaluable.
Report issues and edge cases you encounter. The system must handle diverse environments.
Share your use cases that could benefit from variants.

Package maintainers:

Explore variants for your packages. The wheelnext.dev website, the Wheel Variant proposal and WheelNext GitHub website provides comprehensive documentation.
Consider your hardware matrix. What variants would benefit your users?
Provide feedback on the API. The PEP is still in draft status. Your input shapes the final proposal.

Join us at the PyTorch Conference 2025 for the talk, Hardware-Aware Python Packages ~ PyTorch and WheelNext Grab the Wheel.

Acknowledgments

This work was only possible thanks to the work of many open source contributors and WheelNext community members. To name a few: Michał Górny (Quansight), Ralf Gommers (Quansight), Charlie Marsh (Astral), Konstantin Schütze (Astral), Zanie Blue (Astral), Andrey Talman (Meta), Eli Uriegas (Meta), Chris Gottbrath (Meta) along side to many NVIDIANs: Andy Terrel, Barry Warsaw, Emma Smith, Michael Sarahan, Vyas Ramasubramani, Bradley Dice, Robert Maynard, Hyunsu Cho, Ralf W. Grosse Kunstleve, Leo Fang, Keith Kraus, Piotr Bialecki, Frederic Bastien, Jeremy Tanner, David Edelsohn, Jay Gould, Scott Suchyta, Ankit Patel.

Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants

What are the technical challenges with CUDA compatibility?

What is the Wheel Variant format?

A collaborative solution by and for the scientific computing Python communities

How does the plugin architecture work?

Example NVIDIA GPU-specific implementation

What’s happening behind the scenes?

What are the ecosystem benefits for Python package end users?

What are the ecosystem benefits for package maintainers?

What are the ecosystem benefits for NVIDIA GPU users?

What are the broader applications of Wheel Variants beyond PyTorch and NVIDIA GPU computing?

Research computing environments

Wheel size optimization

Problem solving across the Python ecosystem

How to build Wheel Variants

Convert a Standard Wheel into a Wheel Variant

Flit and Flit-Core

Hatch and Hatchling

Meson-Python

Scikit-Build-Core

What is the implementation road map for Wheel Variants?

Conclusion

Get involved

Acknowledgments

Tags

About the Authors

Comments

Related posts

Developers Can Now Get NVIDIA CUDA Directly from Their Favorite Third-Party Platforms

Unifying the CUDA Python Ecosystem

Announcing NVIDIA CUDA 11.3 Toolkit Availability and Preview Release of CUDA Python

Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams

NVIDIA and Red Hat: Simplifying NVIDIA GPU Driver Deployment on Red Hat Enterprise Linux

Related posts

Accelerating Large-Scale Mixture-of-Experts Training in PyTorch

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

AI Aims to Bring Order to the Law

Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper

Unlock Efficient Data Processing with the Latest from NVIDIA DALI