Developer Tools & Techniques

Cut Checkpoint Costs with About 30 Lines of Python and NVIDIA nvCOMP

Apr 09, 2026

By Wenqi Glantz, Eugene Zhidkov and Makan Taghavi

Discuss (0)

AI-Generated Summary

Dislike

Synchronous checkpointing during large-scale LLM training leads to significant GPU idle costs, often exceeding $200,000 per month for 128 NVIDIA Blackwell GPUs on 405B models, with optimizer state (FP32) being the dominant component of checkpoint size.
Integrating NVIDIA nvCOMP enables GPU-accelerated, lossless compression (ZSTD and gANS), reducing checkpoint sizes by 21-29% for dense and MoE models, reclaiming GPU idle time, and directly translating to monthly savings exceeding $56,000 for large-scale runs.
Compression throughput becomes crucial as storage speed increases; ZSTD is preferred for shared network filesystems (5-10 GB/s), while ANS offers near-equivalent ratios at 10x throughput, making it optimal for high-speed GPUDirect Storage (15+ GB/s) and enabling further cost reductions at scale.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Training LLMs requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage so training can resume after interruptions. At scale, these checkpoints become massive (782 GB for a 70B model) and frequent (every 15-30 minutes), generating one of the largest line items in a training budget. Most AI teams chase GPU utilization, training throughput, and model quality. Almost none look at what checkpointing is costing them.

This is an expensive oversight. The synchronous checkpoint overhead of a 405B model on 128 NVIDIA Blackwell GPUs alone can cost $200,000 a month. By introducing a lossless compression step implemented with about 30 lines of Python, we can reduce that overhead by $40,000 every month with gANS. Mixture of experts (MoE) models save more in absolute terms because their checkpoints and clusters are larger. We’ll break down how we got to that calculation and how NVIDIA nvCOMP can improve checkpointing efficiency in this blog post.

Inside a single checkpoint

Hardware interruptions at 1000+ GPU scale aren’t rare. Meta reported 419 unexpected interruptions across 54 days of Llama 3 training on 16,384 NVIDIA H100 GPUs (~one every 3 hours). This is why most teams checkpoint every 15-30 minutes; it’s load-bearing infrastructure, not optional overhead.

Component	Size	Contents
Model weights (BF16)	130 GB	70B params × 2 bytes
Optimizer state (FP32 momentum + variance)	521 GB	70B params × (4 + 4) bytes
Gradients (BF16)	130 GB	Same shape as weights
Total per checkpoint	782 GB

Table 1. Checkpoint composition

Sizes use binary units (GiB): 70B × 2 bytes = 140 GB in decimal ≈ 130 GiB. We label these “GB” by common convention throughout this post.

This breakdown surprises people who see it for the first time. The optimizer state—AdamW’s first and second moment estimates, both stored in FP32—is 4x larger than the model weights. It’s the bulk of every checkpoint, and why they are far larger than a deployed model.

By checkpointing every 30 minutes, standard practice for fault tolerance, that’s 48 checkpoints per day. Over a month of continuous training:

782 GB × 48/day × 30 days = 1.13 PB written to storage per month

And here’s the part most teams miss. In this synchronous baseline, the training loop blocks during each checkpoint write, and all 8 GPUs wait until the last byte hits storage.

And here’s the part most teams miss. During every single synchronous checkpoint write, all 8 GPUs sit completely idle. Nothing overlaps with a checkpoint save—the training loop blocks until the last byte hits storage.

At $4.40/GPU/hour (representative of on-demand Blackwell GPU cloud pricing) and 5 GB/s shared storage throughput (typical for Lustre or GPFS over InfiniBand), we do the math to figure out the cost of idle GPUs during these waiting periods:

Write time per checkpoint: 782 GB / 5 GB/s = 156.4 seconds (~2.6 minutes)
Total wait time per month: 156.4s × 48/day × 30 days = 225,216 seconds = 62.6 hours
Cost of idle GPUs during these waiting periods: 62.6 hours × 8 GPUs × $4.40/GPU/hour = $2,203/month

The wait time for synchronous checkpoints adds up to over $2,200/month—before counting storage fees.

Scale that to a 64-GPU cluster, and the monthly cost jumps to over $17,500. At 128 GPUs, training a 405B model, idle costs exceed $200,000/month. The idle-GPU cost dominates storage fees by an order of magnitude.

Asynchronous checkpointing eases part of the problem. However, framework support is still maturing and not widely adopted, plus managing memory watermarks remains a challenge. A complementary technique that can be easily used is checkpoint compression, which has the added benefit of reducing cold start time when the state is being restored, since cold start is serial in nature.

NVIDIA nvCOMP introduces GPU-accelerated compression

The core idea is simple. Compress the checkpoint while tensors are still on GPU, before writing a smaller payload to storage. The simple Python example below stages the compressed bytes through CPU for compatibility with `torch.save`; production GDS integrations can keep the compressed buffer on GPU and write it directly to NVMe.

NVIDIA nvCOMP is a GPU-accelerated lossless compression library that does exactly this. By providing a single library with support for both standard algorithms, like Zstandard (ZSTD), and highly optimized, GPU-specific formats, like gANS, it tackles data bottlenecks natively on the device. Developers can easily integrate high-throughput compression directly into Python workflows such as PyTorch-based checkpointing.

Measured checkpoint compression ratios

We fine-tuned two model architectures (dense transformer and mixture of experts) for 50 steps, saved full training checkpoints (weights + AdamW optimizer state + gradients), and compressed every component with nvCOMP on NVIDIA H200 and Blackwell GPUs. The compression ratios depend on data, not hardware—they’re identical across all GPUs. For 2-byte tensors (BF16/FP16), gANS must be invoked with `data_type='<f2'` so it entropy-codes at the correct symbol width; without this hint, gANS leaves ~15% of its achievable BF16 ratio on the table.

ZSTD is a widely adopted general-purpose compression algorithm developed by Meta that balances strong compression ratios with reasonable speed. It’s the same algorithm behind ZSTD on the Linux command line and is used extensively in databases, file systems, and data pipelines. Asymmetric numeral systems (ANS) is a modern entropy coding technique that nvCOMP implements as a GPU-native codec (gANS) optimized for raw throughput. Both are lossless and exploit statistical patterns in 1B/2B words distributions (entropy coding) rather than just matching repeated byte sequences.

ZSTD compresses at ~16 GB/s on Blackwell GPUs. With the `data_type='<f2'` hint applied, gANS exceeds ZSTD’s ratio on BF16 components and runs at roughly 30× the compression throughput (~530 GB/s on BF16).

Checkpoint Component	% of Ckpt	ZSTD	gANS	Why
BF16 model weights	17%	1.27-1.28×	1.46-1.48×	High-entropy trained floats
FP32 optimizer momentum	33%	1.07×	1.06× (byte mode)	gANS lacks native FP32 symbol mode
FP32 optimizer variance	33%	1.11×	1.10×(byte mode)	Non-negative, small values
BF16 gradients	17%	1.24-1.30×	1.42-1.45×	Architecture-dependent sparsity
Full checkpoint (dense)	100%	~1.14×	~1.18×	FP32 byte-mode dominates the weighted mean
Full checkpoint (MoE, estimated)	100%	~1.15×	~1.18×	MoE FP32 components OOM at scale; modeled from dense FP32 ratios

Table 2. Checkpoint compression ratios for dense and MoE models

The ranges reflect the key finding that compression depends on model architecture, not hardware. Not all compression algorithms work on floating-point tensors. Byte-level codecs like LZ4 and Bitcomp look for repeated byte sequences—like finding duplicate words in a document. But trained neural network parameters look essentially random at the byte level, so these codecs find almost nothing to compress (~1.00× on dense checkpoints in our benchmarks).

ZSTD and ANS use entropy coding, which exploits statistical patterns in how frequently certain byte values occur—even when no exact sequences repeat. This is why they achieve 1.14-1.18x on full mixed-precision checkpoints, where byte-level codecs achieve nothing. Between the two, gANS (with the data_type='<f2' hint) delivers stronger ratios than ZSTD on every BF16 component, at ~30x the compression throughput. This trade-off matters when storage gets faster, as we’ll show.

Dense transformers (Llama, GPT, Qwen): All parameters participate in every forward pass. ~0% exact zeros → ~1.18× gANS, ~1.14× ZSTD.
MoE models (Mixtral, DeepSeek, OLMoE): Only a subset of experts activate per token. ~3-4% exact zeros (measured on OLMoE-1B-7B) → ~1.18× gANS, ~1.15× ZSTD.

Our benchmarks use BF16 weights and FP32 optimizer state (AdamW), which is standard for most large-scale training today. Teams using FP8 training (e.g., with NVIDIA Transformer Engine) will see lower compression ratios, as reduced-precision data carries higher entropy and less statistical redundancy for lossless compression to exploit. At FP4 (NVFP4), quantization already removes most redundancy—lossless compression provides negligible additional benefit. The optimizer state, however, remains in FP32 regardless of weight precision, and that’s where the bulk of checkpoint size and compression savings come from.

The math: How nvCOMP saves money

Applying our measured gANS 1.18x ratio to a 70B dense checkpoint:

Without nvCOMP: Write 782 GB to disk at 5 GB/s → 156 seconds of GPU wait
With nvCOMP gANS (1.18x): Compress 782 GB at ~530 GB/s (~1.5s), write 663 GB at 5 GB/s (133s) → ~133 seconds of GPU wait

Why does the 4-second compression time disappear? Because compression and storage writes can be pipelined: while one chunk writes to disk, the next chunk compresses on the GPU. As long as the codec compresses faster than storage can absorb the output, the compression step is fully hidden behind the write — the GPU wait equals the write time alone. At 5 GB/s shared storage, gANS at ~530 GB/s is ~106× faster than the write, so compression overlaps completely. The wait drops from 156s to 133s — 15% smaller files, 23 seconds faster per checkpoint. Over a month: 23s x 48/day × 30 days = 33,120 fewer seconds of idle time — 9+ hours reclaimed. MoE checkpoints compress at a near-identical ratio but are far larger, so they reclaim more in absolute terms.

When storage gets faster, codec throughput matters

Your storage speed depends on infrastructure: 2-10 GB/s for shared network filesystems (Lustre, GPFS, NFS), or 15-50+ GB/s for GPUDirect Storage (GDS) with local NVMe.

At faster storage, codec throughput determines whether compression helps or hurts:

Storage speed	No compression	ZSTD (1.14×, ~16 GB/s)	ANS (1.18×, ~530 GB/s)	Winner
5 GB/s	156s	137s (−12%)	133s (−15%)	gANS
15 GB/s	52s	49s (−6%)	44s (−15%)	gANS
25 GB/s	31s	49s (+58%) ⚠️	27s (−13%)	gANS

Table 3. Storage speed and codec throughput

To illustrate, we compare checkpoint write times for a 70B dense model (782 GB) on a NVIDIA Blackwell GPU across three storage speeds. Because compression and writing are pipelined — one chunk compresses while the previous chunk writes to disk — the total GPU wait time equals whichever stage is slower: pipelined wait = max(compress time, write time).

In table 3 ~16 GB/s on Blackwell GPU, ZSTD’s compression step becomes the bottleneck, and wait time increases. gANS at ~530 GB/s never hits this wall. With the `data_type='<f2'` hint applied, gANS is the right default whenever decompression can stay on the GPU; ZSTD remains useful when decompression must run on CPU or downstream consumers expect ZSTD-compatible files.

Modeled monthly savings assume GPU wait reduction + storage at $0.14/GB/month, 96 retained checkpoints, and 5 GB/s shared storage:

Dense models (measured on Qwen2.5-3B):

Model	GPUs	Ckpt	ZSTD (1.14×)	gANS (1.18×)
Llama 3 8B	8× Blackwell GPU	89 GB	78 GB / ~$180/mo	75 GB / ~$220/mo
Llama 3 70B	8× Blackwell GPU	782 GB	686 GB / ~$1,500/mo	663 GB / ~$1,900/mo
Llama 3 70B	64× Blackwell GPU	782 GB	686 GB / ~$3,400/mo	663 GB / ~$4,200/mo
Llama 3 405B	128× Blackwell GPU	4,529 GB	3,974 GB / ~$33,000/mo	3,838 GB / ~$40,000/mo

Table 4. Measured monthly savings with nvCOMP checkpoint compression for dense models

MoE models (measured on OLMoE-1B-7B); FP32 optimizer extrapolated from dense):

Model	GPUs	Ckpt	ZSTD (~1.15×)	gANS (~1.18×)
Mixtral 8x22B (141B)	64× Blackwell GPU	1,575 GB	1,370 GB / ~$7,400/mo	1,335 GB / ~$8,600/mo
DeepSeek-V3 (671B)	256× Blackwell GPU	7,490 GB	6,513 GB / ~$101,000/mo	6,348 GB / ~$118,000/mo

Table 5. Measured monthly savings with nvCOMP checkpoint compression for MoE models

The savings scale with model size (bigger checkpoints = more to compress) and GPU count (more GPUs idle during waits = higher cost per second of wait time). The second factor is what makes this particularly brutal at scale—256 idle Blackwell GPUs cost $1,126/hour. Every second you shave off a checkpoint write saves money.

The industry is moving toward MoE architectures (like DeepSeek-V3, Mixtral, and Grok), which produce larger. Compression isn’t a nice-to-have anymore, and adding a few lines of Python code can save costs.

Integration: ~30 Lines of Python

The integration effort is minimal. Install the CUDA-versioned nvCOMP and CuPy wheels:

# CUDA 13.x (Blackwell):
pip install nvidia-nvcomp-cu13 cupy-cuda13x

# CUDA 12.x (Hopper):
pip install nvidia-nvcomp-cu12 cupy-cuda12x

Then add a small wrapper around your checkpoint state:

import torch, cupy as cp
from nvidia import nvcomp

ALG, TAG = "ANS", "__nvcomp__"

def _codec(item_size):
    opts = {"data_type": "<f2"} if ALG == "ANS" and item_size == 2 else {}
    return nvcomp.Codec(algorithm=ALG, **opts)

def _walk(x, leaf):
    if isinstance(x, dict):
        return {k: _walk(v, leaf) for k, v in x.items()}
    if isinstance(x, (list, tuple)):
        return type(x)(_walk(v, leaf) for v in x)
    return leaf(x)

def _pack(t):
    if not (isinstance(t, torch.Tensor) and t.is_cuda):
        return t
    item_size, y = t.element_size(), t.contiguous()
    raw = y.reshape(-1).view(torch.int16 if item_size == 2 else torch.uint8)
    comp = _codec(item_size).encode(nvcomp.as_array(cp.from_dlpack(raw)))
    return {
        TAG: True,
        "data": bytes(comp.cpu()),
        "shape": list(y.shape),
        "dtype": str(y.dtype),
        "item_size": item_size,
    }

def _unpack(x):
    if not (isinstance(x, dict) and x.get(TAG)):
        return x
    dec = _codec(x["item_size"]).decode(x["data"])
    dtype = getattr(torch, x["dtype"].replace("torch.", ""))
    return torch.from_dlpack(cp.asarray(dec)).view(dtype).reshape(x["shape"])

def save_compressed_checkpoint(state, path):
    torch.save(_walk(state, _pack), path)

def load_compressed_checkpoint(path):
    state = torch.load(path, map_location="cpu", weights_only=True)
    return _walk(state, _unpack)

Use it with the same checkpoint state you already save:

state = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scheduler": scheduler.state_dict(),
    "step": step,
    # Include the same gradients, RNG/scaler state, and distributed metadata
    # your normal checkpoint path requires.
}
save_compressed_checkpoint(state, "ckpt.pt")
state = load_compressed_checkpoint("ckpt.pt")

Swap `torch.save(state, path)` for `save_compressed_checkpoint(state, path)`, and `torch.load(path)` for `load_compressed_checkpoint(path)`.

That’s it. No changes to your model code or optimizer configuration; wrap the same checkpoint state your training loop already writes.

If you’re using a training framework with custom checkpoint hooks (DeepSpeed, Megatron), the same pattern applies. Walk the state dict, compress GPU tensors, and serialize.

For teams using NVIDIA GPUDirect Storage (GDS), there’s a lower-overhead production path. nvCOMP can write compressed output into GPU buffers that GDS transfers to NVMe, avoiding the CPU staging used by the simple `torch.save` example.

NVIDIA Blackwell Decompression Engine

NVIDIA Blackwell GPUs include a dedicated Blackwell Decompression Engine (DE) that decompresses LZ4, Snappy, and Deflate at up to 280 GB/s with zero SM overhead. However, the byte-level codecs achieve ~1.00x on floating-point tensors. For checkpoint compression, the entropy-based codecs (ANS and ZSTD) on SMs deliver the real savings. During restores, GPUs are idle and waiting for data before resuming training. SM availability isn’t a constraint, and ANS on SMs delivers comparable throughput (~265 GB/s on BF16 components) while producing significantly smaller files.

Get started

Checkpoint compression is one of the highest-ROI optimizations you can add to your training pipeline and one of the easiest:

Install nvCOMP: pip install nvidia-nvcomp-cu13 (Blackwell) or pip install nvidia-nvcomp-cu12 (Hopper)
Try it now: Drop the save_compressed_checkpoint / load_compressed_checkpoint functions from this post into your training loop — no model or optimizer changes needed; wrap the same checkpoint state your training loop already writes.
Explore the code: Samples and benchmarks on GitHub
Read the docs: nvCOMP documentation
Go deeper: For GDS production paths, compressed GPU buffers can be transferred directly to NVMe, avoiding the CPU staging used by the simple example. For details, see:
- https://docs.nvidia.com/cuda/nvcomp/
- https://docs.nvidia.com/gpudirect-storage/api-reference-guide/

Your checkpoints are the largest files in your training pipeline. Compress them.

Discuss (0)

About the Authors

About Wenqi Glantz
Wenqi Glantz is a tech engagement lead on the NVIDIA Super AI Startups team. Wenqi serves as the technical liaison for AI labs adopting the NVIDIA full acceleration stack spanning GPU architectures, CUDA-X libraries, NeMo frameworks, training, and inference optimization. With over two decades of experience in software engineering, enterprise architecture, and Generative AI, Wenqi brings deep hands-on expertise to the intersection of high-performance infrastructure and AI.

View all posts by Wenqi Glantz

About Eugene Zhidkov
Eugene is a compute devtech manager for emerging workloads at NVIDIA. His team brings speed-of-light acceleration to large CSP and enterprise workloads with advances in GPU compression and data processing. Before NVIDIA, Eugene helped Apple evolve their GPU HW and SW to strong-scale from phones to Ultra chips, focusing on professional applications and ML, co-designing new metal features together with top-tier application ISVs. Eugene holds an MSc in Computer Science and Applied Math.

View all posts by Eugene Zhidkov

About Makan Taghavi
Makan Taghavi is a senior product manager at NVIDIA focusing on image processing and data compression technologies, working on nvCOMP, DALI, nvImageCodec, nvTIFF, NPP and nvJPEG libraries. Prior to joining NVIDIA, Makan led the Low Power AI and Audio/Voice AI software stack at Qulacomm, driving over 100 design wins and helping establish Qualcomm's Low Power AI software platform as one of the leaders in AI on the Edge.

View all posts by Makan Taghavi