Agentic AI / Generative AI

Train an LLM on NVIDIA Blackwell with Unsloth—and Scale for Production

Fine-tuning and reinforcement learning (RL) for large language models (LLMs) require advanced expertise and complex workflows, making them out of reach for many. The open source Unsloth project changes that by streamlining the process, making it easier for individuals and small teams to explore LLM customization. When paired with the efficiency and throughput of the NVIDIA Blackwell GPUs, this combination helps democratize access to LLM development, opening the door for a wider community of practitioners to innovate.

This post explains how developers can train custom LLMs locally on NVIDIA RTX PRO 6000 Blackwell Series, GeForce RTX 50 Series, and NVIDIA DGX Spark using Unsloth. It also covers how these same workflows scale seamlessly into Blackwell-powered cloud instances, such as NVIDIA DGX Cloud and those from NVIDIA Cloud Partners, for production workloads.

What is Unsloth?

Unsloth is an open source framework that simplifies and accelerates LLM fine-tuning and RL. It uses custom Triton kernels and algorithms to deliver:

  • 2x faster training throughput
  • 70% less VRAM usage
  • No accuracy loss

It supports popular models such as Llama, gpt-oss, and DeepSeek, and is now optimized for NVIDIA Blackwell GPUs with NVFP4 precision.

With support from the NVIDIA DGX Cloud AI team, Unsloth extends from consumer GPUs, such as the GeForce RTX 50 Series, RTX PRO 6000 Blackwell Series, and NVIDIA GB10-based developer workstations (such as the NVIDIA DGX Spark), to enterprise-class NVIDIA HGX B200 and NVIDIA GB200 NVL72 systems. This makes fine-tuning accessible to everyone.

How does Unsloth perform on NVIDIA Blackwell? 

Unsloth benchmarks show that, with NVIDIA Blackwell, it delivers significant gains compared to other optimized setups, including Flash Attention 2. Specifically, it delivers:

  • 2x increase in training speed
  • 70% VRAM reduction (even for 70B+ parameter models)
  • 12x longer context windows

These results mean that you can now fine-tune models with as many as 40 billion parameters on a single Blackwell GPU.

Test setup: NVIDIA GeForce RTX 5090 GPU with 32 GB of VRAM, Alpaca dataset, batch size = 2, gradient accumulation = 4, rank = 32, QLoRA applied on all linear layers.

ModelVRAMUnsloth speedVRAM reductionLonger contextHugging Face + FA2
Llama 3.1 (8B)80 GB2x>70%12x longer1x
Table 1. Performance benchmarks for Unsloth on a GeForce RTX 5090 GPU
VRAMUnsloth context lengthHugging Face + FA2 context length
8 GB2,972OOM
12 GB21,848932
16 GB40,7242,551
24 GB78,4755,789
32 GB122,1819,711
Table 2. Detailed benchmarks for different context lengths for Unsloth on a GeForce RTX 5090 GPU

How to set up Unsloth on NVIDIA GPUs

Unsloth setup is easy, whether you prefer a quick pip install, an isolated virtual environment, or a containerized Docker deployment. Try the following examples on any Blackwell generation GPU, including the GeForce RTX 50 Series.

pip install unsloth

Running a 20B model

The following example shows what it might look like to run the gpt-oss-20b model:

from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

Docker deployment

Unsloth also offers a prebuilt Docker image, which is supported in NVIDIA Blackwell GPUs. 

Note that the Docker container requires the NVIDIA Container Toolkit to be installed on your host system.
Before running the following command, fill in your specific information:

docker run -d -e JUPYTER_PASSWORD="mypassword" \
  -p 8888:8888 -p 2222:22 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Using an isolated environment

Issue the following commands from the shell to install Unsloth using Python:

python -m venv unsloth
source unsloth/bin/activate
pip install unsloth

Note: Depending on your system, you may need to use pip3 / pip3.13 and python3 / python3.13.

Handling issues with xFormers 

If you encounter issues with xFormers, build from source. 

First, uninstall any existing xFormers:

pip uninstall xformers -y

Next, clone and build:

pip install ninja
export TORCH_CUDA_ARCH_LIST="12.0"
git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
cd xformers && python setup.py install && cd ..

Using uv

If you prefer to use uv, install Unsloth using the following command:

uv pip install unsloth

While Unsloth enables local experimentation with 20B and 40B models on a single Blackwell GPU, the same workflows are fully portable to NVIDIA DGX Cloud and NVIDIA Cloud Partners. This enables scaling to clusters of Blackwell GPUs for fine-tuning 70B+ models, reinforcement learning, and enterprise workloads without changing a line of code.

Get started transforming LLM training runs

From experimentation to production, NVIDIA DGX Cloud and NVIDIA Cloud Partners deliver the power to train and fine-tune at any scale—combining elastic compute, enterprise storage, and real-time monitoring in fully managed AI environments optimized for NVIDIA GPUs.

According to Unsloth Co-Founder Daniel Han, “AI shouldn’t be an exclusive club. The next great AI breakthrough could come from anywhere—students, individual researchers, or small startups. Unsloth is here to ensure they have the tools they need.”

Start locally on your NVIDIA GeForce RTX 50 Series GPU, NVIDIA RTX PRO 6000 Blackwell Series GPU, or NVIDIA DGX Spark system to fine-tune models with Unsloth. Then scale seamlessly with NVIDIA DGX Cloud or an NVIDIA Cloud Partner to harness clusters of Blackwell GPUs with enterprise-grade reliability and visibility—all without compromise. Check out the step-by-step guide to fine-tuning LLMs with NVIDIA Blackwell GPUs and Unsloth, and how to install the software on NVIDIA DGX Spark.

Discuss (0)

Tags