Training LLMs requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage so training can resume after interruptions. At scale, these checkpoints become massive (782 GB for a 70B model) and frequent (every 15-30 minutes), generating one of the largest line items in a training budget. Most AI teams chase GPU utilization, training throughput, and model quality. Almost none look at what checkpointing is costing them.
This is an expensive oversight. The synchronous checkpoint overhead of a 405B model on 128 NVIDIA Blackwell GPUs alone can cost $200,000 a month. By introducing a lossless compression step implemented with about 30 lines of Python, we can reduce that overhead by $40,000 every month with gANS. Mixture of experts (MoE) models save more in absolute terms because their checkpoints and clusters are larger. We’ll break down how we got to that calculation and how NVIDIA nvCOMP can improve checkpointing efficiency in this blog post.
Inside a single checkpoint
Hardware interruptions at 1000+ GPU scale aren’t rare. Meta reported 419 unexpected interruptions across 54 days of Llama 3 training on 16,384 NVIDIA H100 GPUs (~one every 3 hours). This is why most teams checkpoint every 15-30 minutes; it’s load-bearing infrastructure, not optional overhead.

| Component | Size | Contents |
| Model weights (BF16) | 130 GB | 70B params × 2 bytes |
| Optimizer state (FP32 momentum + variance) | 521 GB | 70B params × (4 + 4) bytes |
| Gradients (BF16) | 130 GB | Same shape as weights |
| Total per checkpoint | 782 GB |
Sizes use binary units (GiB): 70B × 2 bytes = 140 GB in decimal ≈ 130 GiB. We label these “GB” by common convention throughout this post.
This breakdown surprises people who see it for the first time. The optimizer state—AdamW’s first and second moment estimates, both stored in FP32—is 4x larger than the model weights. It’s the bulk of every checkpoint, and why they are far larger than a deployed model.
By checkpointing every 30 minutes, standard practice for fault tolerance, that’s 48 checkpoints per day. Over a month of continuous training:
782 GB × 48/day × 30 days = 1.13 PB written to storage per month
And here’s the part most teams miss. In this synchronous baseline, the training loop blocks during each checkpoint write, and all 8 GPUs wait until the last byte hits storage.
And here’s the part most teams miss. During every single synchronous checkpoint write, all 8 GPUs sit completely idle. Nothing overlaps with a checkpoint save—the training loop blocks until the last byte hits storage.
At $4.40/GPU/hour (representative of on-demand Blackwell GPU cloud pricing) and 5 GB/s shared storage throughput (typical for Lustre or GPFS over InfiniBand), we do the math to figure out the cost of idle GPUs during these waiting periods:
- Write time per checkpoint: 782 GB / 5 GB/s = 156.4 seconds (~2.6 minutes)
- Total wait time per month: 156.4s × 48/day × 30 days = 225,216 seconds = 62.6 hours
- Cost of idle GPUs during these waiting periods: 62.6 hours × 8 GPUs × $4.40/GPU/hour = $2,203/month
The wait time for synchronous checkpoints adds up to over $2,200/month—before counting storage fees.
Scale that to a 64-GPU cluster, and the monthly cost jumps to over $17,500. At 128 GPUs, training a 405B model, idle costs exceed $200,000/month. The idle-GPU cost dominates storage fees by an order of magnitude.
Asynchronous checkpointing eases part of the problem. However, framework support is still maturing and not widely adopted, plus managing memory watermarks remains a challenge. A complementary technique that can be easily used is checkpoint compression, which has the added benefit of reducing cold start time when the state is being restored, since cold start is serial in nature.
NVIDIA nvCOMP introduces GPU-accelerated compression
The core idea is simple. Compress the checkpoint while tensors are still on GPU, before writing a smaller payload to storage. The simple Python example below stages the compressed bytes through CPU for compatibility with `torch.save`; production GDS integrations can keep the compressed buffer on GPU and write it directly to NVMe.
NVIDIA nvCOMP is a GPU-accelerated lossless compression library that does exactly this. By providing a single library with support for both standard algorithms, like Zstandard (ZSTD), and highly optimized, GPU-specific formats, like gANS, it tackles data bottlenecks natively on the device. Developers can easily integrate high-throughput compression directly into Python workflows such as PyTorch-based checkpointing.
Measured checkpoint compression ratios
We fine-tuned two model architectures (dense transformer and mixture of experts) for 50 steps, saved full training checkpoints (weights + AdamW optimizer state + gradients), and compressed every component with nvCOMP on NVIDIA H200 and Blackwell GPUs. The compression ratios depend on data, not hardware—they’re identical across all GPUs. For 2-byte tensors (BF16/FP16), gANS must be invoked with `data_type='<f2'` so it entropy-codes at the correct symbol width; without this hint, gANS leaves ~15% of its achievable BF16 ratio on the table.
ZSTD is a widely adopted general-purpose compression algorithm developed by Meta that balances strong compression ratios with reasonable speed. It’s the same algorithm behind ZSTD on the Linux command line and is used extensively in databases, file systems, and data pipelines. Asymmetric numeral systems (ANS) is a modern entropy coding technique that nvCOMP implements as a GPU-native codec (gANS) optimized for raw throughput. Both are lossless and exploit statistical patterns in 1B/2B words distributions (entropy coding) rather than just matching repeated byte sequences.
ZSTD compresses at ~16 GB/s on Blackwell GPUs. With the `data_type='<f2'` hint applied, gANS exceeds ZSTD’s ratio on BF16 components and runs at roughly 10× the compression throughput (~190-215 GB/s on BF16).

| Checkpoint Component | % of Ckpt | ZSTD | gANS | Why |
| BF16 model weights | 17% | 1.27-1.28× | 1.46-1.48× | High-entropy trained floats |
| FP32 optimizer momentum | 33% | 1.07× | 1.06× (byte mode) | gANS lacks native FP32 symbol mode |
| FP32 optimizer variance | 33% | 1.11× | 1.10×(byte mode) | Non-negative, small values |
| BF16 gradients | 17% | 1.24-1.30× | 1.42-1.45× | Architecture-dependent sparsity |
| Full checkpoint (dense) | 100% | ~1.14× | ~1.18× | FP32 byte-mode dominates the weighted mean |
| Full checkpoint (MoE, estimated) | 100% | ~1.15× | ~1.18× | MoE FP32 components OOM at scale; modeled from dense FP32 ratios |
The ranges reflect the key finding that compression depends on model architecture, not hardware. Not all compression algorithms work on floating-point tensors. Byte-level codecs like LZ4 and Bitcomp look for repeated byte sequences—like finding duplicate words in a document. But trained neural network parameters look essentially random at the byte level, so these codecs find almost nothing to compress (~1.00× on dense checkpoints in our benchmarks).
ZSTD and ANS use entropy coding, which exploits statistical patterns in how frequently certain byte values occur—even when no exact sequences repeat. This is why they achieve 1.14-1.18x on full mixed-precision checkpoints, where byte-level codecs achieve nothing. Between the two, gANS (with the data_type='<f2' hint) delivers stronger ratios than ZSTD on every BF16 component, at ~10x the compression throughput. This trade-off matters when storage gets faster, as we’ll show.
- Dense transformers (Llama, GPT, Qwen): All parameters participate in every forward pass. ~0% exact zeros → ~1.18× ANS, ~1.14× ZSTD.
- MoE models (Mixtral, DeepSeek, OLMoE): Only a subset of experts activate per token. ~3-4% exact zeros (measured on OLMoE-1B-7B) → ~1.18× ANS, ~1.15× ZSTD.
Our benchmarks use BF16 weights and FP32 optimizer state (AdamW), which is standard for most large-scale training today. Teams using FP8 training (e.g., with NVIDIA Transformer Engine) will see lower compression ratios, as reduced-precision data carries higher entropy and less statistical redundancy for lossless compression to exploit. At FP4 (NVFP4), quantization already removes most redundancy—lossless compression provides negligible additional benefit. The optimizer state, however, remains in FP32 regardless of weight precision, and that’s where the bulk of checkpoint size and compression savings come from.
The math: How nvCOMP saves money
Applying our measured gANS 1.18x ratio to a 70B dense checkpoint:
- Without nvCOMP: Write 782 GB to disk at 5 GB/s → 156 seconds of GPU wait
- With nvCOMP gANS (1.18x): Compress 782 GB at ~200 GB/s (~4s), write 663 GB at 5 GB/s (133s) → ~133 seconds of GPU wait
Why does the 4-second compression time disappear? Because compression and storage writes can be pipelined: while one chunk writes to disk, the next chunk compresses on the GPU. As long as the codec compresses faster than storage can absorb the output, the compression step is fully hidden behind the write — the GPU wait equals the write time alone. At 5 GB/s shared storage, gANS at 200 GB/s is 40× faster than the write, so compression overlaps completely. The wait drops from 156s to 133s — 15% smaller files, 23 seconds faster per checkpoint. Over a month: 23s x 48/day × 30 days = 33,120 fewer seconds of idle time — 9+ hours reclaimed. MoE checkpoints compress at a near-identical ratio but are far larger, so they reclaim more in absolute terms.
When storage gets faster, codec throughput matters
Your storage speed depends on infrastructure: 2-10 GB/s for shared network filesystems (Lustre, GPFS, NFS), or 15-50+ GB/s for GPUDirect Storage (GDS) with local NVMe.
At faster storage, codec throughput determines whether compression helps or hurts:

| Storage speed | No compression | ZSTD (1.14×, ~16 GB/s) | ANS (1.18×, ~200 GB/s) | Winner |
| 5 GB/s | 156s | 137s (−12%) | 133s (−15%) | gANS |
| 15 GB/s | 52s | 49s (−6%) | 44s (−15%) | gANS |
| 25 GB/s | 31s | 49s (+58%) ⚠️ | 27s (−13%) | gANS |
To illustrate, we compare checkpoint write times for a 70B dense model (782 GB) on a NVIDIA Blackwell GPU across three storage speeds. Because compression and writing are pipelined — one chunk compresses while the previous chunk writes to disk — the total GPU wait time equals whichever stage is slower: pipelined wait = max(compress time, write time).
In table 3 ~16 GB/s on Blackwell GPU, ZSTD’s compression step becomes the bottleneck, and wait time increases. gANS at ~200 GB/s never hits this wall. With the `data_type='<f2'` hint applied, gANS is the right default whenever decompression can stay on the GPU; ZSTD remains useful when decompression must run on CPU or downstream consumers expect ZSTD-compatible files.
Modeled monthly savings assume GPU wait reduction + storage at $0.14/GB/month, 96 retained checkpoints, and 5 GB/s shared storage:

Dense models (measured on Qwen2.5-3B):
| Model | GPUs | Ckpt | ZSTD (1.14×) | gANS (1.18×) |
| Llama 3 8B | 8× Blackwell GPU | 89 GB | 78 GB / ~$180/mo | 75 GB / ~$220/mo |
| Llama 3 70B | 8× Blackwell GPU | 782 GB | 686 GB / ~$1,500/mo | 663 GB / ~$1,900/mo |
| Llama 3 70B | 64× Blackwell GPU | 782 GB | 686 GB / ~$3,400/mo | 663 GB / ~$4,200/mo |
| Llama 3 405B | 128× Blackwell GPU | 4,529 GB | 3,974 GB / ~$33,000/mo | 3,838 GB / ~$40,000/mo |
MoE models (measured on OLMoE-1B-7B); FP32 optimizer extrapolated from dense):
| Model | GPUs | Ckpt | ZSTD (1.15×) | gANS (1.18×) |
| Mixtral 8x22B (141B) | 64× Blackwell GPU | 1,575 GB | 1,370 GB / ~$7,400/mo | 1,335 GB / ~$8,600/mo |
| DeepSeek-V3 (671B) | 256× Blackwell GPU | 7,490 GB | 6,513 GB / ~$101,000/mo | 6,348 GB / ~$118,000/mo |
The savings scale with model size (bigger checkpoints = more to compress) and GPU count (more GPUs idle during waits = higher cost per second of wait time). The second factor is what makes this particularly brutal at scale—256 idle Blackwell GPUs cost $1,126/hour. Every second you shave off a checkpoint write saves money.
The industry is moving toward MoE architectures (like DeepSeek-V3, Mixtral, and Grok), which produce larger. Compression isn’t a nice-to-have anymore, and adding a few lines of Python code can save costs.
Integration: ~30 Lines of Python
The integration effort is minimal. Install the CUDA-versioned nvCOMP and CuPy wheels:
# CUDA 13.x (Blackwell):
pip install nvidia-nvcomp-cu13 cupy-cuda13x
# CUDA 12.x (Hopper):
pip install nvidia-nvcomp-cu12 cupy-cuda12x
Then add a small wrapper around your checkpoint state:
import torch, cupy as cp
from nvidia import nvcomp
ALG, TAG = "ANS", "__nvcomp__"
def _codec(item_size):
opts = {"data_type": "<f2"} if ALG == "ANS" and item_size == 2 else {}
return nvcomp.Codec(algorithm=ALG, **opts)
def _walk(x, leaf):
if isinstance(x, dict):
return {k: _walk(v, leaf) for k, v in x.items()}
if isinstance(x, (list, tuple)):
return type(x)(_walk(v, leaf) for v in x)
return leaf(x)
def _pack(t):
if not (isinstance(t, torch.Tensor) and t.is_cuda):
return t
item_size, y = t.element_size(), t.contiguous()
raw = y.reshape(-1).view(torch.int16 if item_size == 2 else torch.uint8)
comp = _codec(item_size).encode(nvcomp.as_array(cp.from_dlpack(raw)))
return {
TAG: True,
"data": bytes(comp.cpu()),
"shape": list(y.shape),
"dtype": str(y.dtype),
"item_size": item_size,
}
def _unpack(x):
if not (isinstance(x, dict) and x.get(TAG)):
return x
dec = _codec(x["item_size"]).decode(x["data"])
dtype = getattr(torch, x["dtype"].replace("torch.", ""))
return torch.from_dlpack(cp.asarray(dec)).view(dtype).reshape(x["shape"])
def save_compressed_checkpoint(state, path):
torch.save(_walk(state, _pack), path)
def load_compressed_checkpoint(path):
state = torch.load(path, map_location="cpu", weights_only=True)
return _walk(state, _unpack)
Use it with the same checkpoint state you already save:
state = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"scheduler": scheduler.state_dict(),
"step": step,
# Include the same gradients, RNG/scaler state, and distributed metadata
# your normal checkpoint path requires.
}
save_compressed_checkpoint(state, "ckpt.pt")
state = load_compressed_checkpoint("ckpt.pt")
Swap `torch.save(state, path)` for `save_compressed_checkpoint(state, path)`, and `torch.load(path)` for `load_compressed_checkpoint(path)`.
That’s it. No changes to your model code or optimizer configuration; wrap the same checkpoint state your training loop already writes.
If you’re using a training framework with custom checkpoint hooks (DeepSpeed, Megatron), the same pattern applies. Walk the state dict, compress GPU tensors, and serialize.
For teams using NVIDIA GPUDirect Storage (GDS), there’s a lower-overhead production path. nvCOMP can write compressed output into GPU buffers that GDS transfers to NVMe, avoiding the CPU staging used by the simple `torch.save` example.
NVIDIA Blackwell Decompression Engine
NVIDIA Blackwell GPUs include a dedicated Blackwell Decompression Engine (DE) that decompresses LZ4, Snappy, and Deflate at up to 280 GB/s with zero SM overhead. However, the byte-level codecs achieve ~1.00x on floating-point tensors. For checkpoint compression, the entropy-based codecs (ANS and ZSTD) on SMs deliver the real savings. During restores, GPUs are idle and waiting for data before resuming training. SM availability isn’t a constraint, and ANS on SMs delivers comparable throughput (~290 GB/s on BF16 components) while producing significantly smaller files.
Get started
Checkpoint compression is one of the highest-ROI optimizations you can add to your training pipeline and one of the easiest:
- Install nvCOMP:
pip install nvidia-nvcomp-cu13(Blackwell) orpip install nvidia-nvcomp-cu12(Hopper) - Try it now: Drop the
save_compressed_checkpoint/load_compressed_checkpointfunctions from this post into your training loop — no model or optimizer changes needed; wrap the same checkpoint state your training loop already writes. - Explore the code: Samples and benchmarks on GitHub
- Read the docs: nvCOMP documentation
- Go deeper: For GDS production paths, compressed GPU buffers can be transferred directly to NVMe, avoiding the CPU staging used by the simple example. For details, see:
- https://docs.nvidia.com/cuda/nvcomp/
- https://docs.nvidia.com/gpudirect-storage/api-reference-guide/
Your checkpoints are the largest files in your training pipeline. Compress them.