Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments.

This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method. For a general introduction to model quantization, see Model Quantization: Concepts, Methods, and Why It Matters.

What is NVIDIA Model Optimizer?

The NVIDIA Model Optimizer (ModelOpt) library incorporates state-of-the-art model optimization techniques to compress and accelerate AI models. These techniques include quantization, distillation, pruning, speculative decoding, and sparsity. ModelOpt accepts Hugging Face, PyTorch, or ONNX format models as input and provides Python APIs for users to easily combine different optimization techniques to produce optimized checkpoints.

ModelOpt supports highly performant quantization formats such as FP4, FP8, INT8, and INT4, and advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization. It supports both PTQ and quantization-aware training (QAT).

What is CLIP?

CLIP (Contrastive Language-Image Pretraining), introduced by OpenAI in 2021, is a foundation vision language model (VLM) that learns a shared embedding space for images and text through contrastive learning on large image-text pairs. Its ability to produce semantically aligned representations has made it a core building block across modern multimodal systems.

The CLIP text encoder is widely reused as a conditioning module for text-to-image (Stable Diffusion, for example) and text-to-video (AnimateDiff, for example) synthesis. Its vision encoder serves as the visual backbone in multimodal LLMs, such as LLaVA, and open-vocabulary perception models, such as OWL-ViT. Successors such as OpenCLIP and SigLIP scale the data and refine the objective but preserve the dual-encoder contrastive paradigm.

Quantization recipe

The following quantization recipe is used in this post as a step-by-step guide for running CLIP model quantization with ModelOpt to understand how the process works.

First, prepare the corresponding models and datasets as shown below:

Base CLIP model: CLIP-ViT-L-14-laion2B-s32B-b82K
Calibration dataset for quantization: 10K subset from MS-COCO
Model accuracy evaluation tasks focus on three from the CLIP_benchmark
- cifar100 (zero-shot classification)
- imagenet1k (zero-shot classification)
- mscoco_captions (zero-shot retrieval)

How to run PTQ with ModelOpt

The following code sample shows how to run PTQ for the CLIP model in FP8 using ModelOpt:

import torch
from torch.utils.data import DataLoader, Subset
from transformers import CLIPModel, CLIPTokenizer, CLIPImageProcessor
from transformers.models.clip.modeling_clip import CLIPAttention

import modelopt.torch.opt as mto
import modelopt.torch.quantization as mtq
from modelopt.torch.quantization.plugins.diffusion.diffusers import _QuantAttention

# FP8 (E4M3) per-tensor static quantization
FP8_CFG = {
    "quant_cfg": {
        "*weight_quantizer":      {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*input_quantizer":       {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*[qkv]_bmm_quantizer":   {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "*bmm2_output_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"},
        "default": {"enable": False},
    },
    "algorithm": "max",
}


mto.enable_huggingface_checkpointing()
mtq.QuantModuleRegistry.register({CLIPAttention: "CLIPAttention"})(_QuantAttention)

model = CLIPModel.from_pretrained(args.model_ckpt, attn_implementation="sdpa").half().eval().cuda()

tokenizer = CLIPTokenizer.from_pretrained(args.model_ckpt)
processor = CLIPImageProcessor.from_pretrained(args.model_ckpt)
calib_set = Subset(CLIP_COCO_dataset(ANN, IMG_DIR, tokenizer, processor), range(8192))
loader    = DataLoader(calib_set, batch_size=512, num_workers=4)

# Calibration: 8k MS-COCO image-text pairs
def calibrate(m):
    for img, txt in loader:
        m.get_text_features(input_ids=txt.cuda())
        m.get_image_features(pixel_values=img.cuda())

q_model = mtq.quantize(model, FP8_CFG, forward_loop=calibrate)

# Save quantized modelopt checkpoint
q_model.save_pretrained(ckpt_path)
mtq.print_quant_summary(q_model)

FP8_CFG is just one recipe: W8A8 (FP8 on both weights and activations), per-tensor, static quantization, calibrated with the simple AbsMax algorithm. ModelOpt supports many more dimensions of choice (per-channel / block-wise granularity, dynamic activations quantization, advanced calibration algorithms such as AWQ / GPTQ, and many more).

For the detailed configuration schema, see the ModelOpt quantization guide. The hyperparameters in the quantization configuration can always be fine-tuned as needed, and finding the optimal values usually requires some iteration.

After mtq.quantize returns, CLIP’s Linear layers all carry weight and activation quantizers—but the attention blocks are still untouched. This is because multi-head attention dispatches to torch.nn.functional.scaled_dot_product_attention, a functional API that the ModelOpt module walker cannot intercept on its own.

To bring attention into the quantization scope, register a quantized replacement for CLIPAttention:

mtq.QuantModuleRegistry.register({CLIPAttention: 
"CLIPAttention"})(_QuantAttention)

Each CLIPAttention instance is now upgraded to _QuantAttention from the ModelOpt diffusers plugin. Inside its forward pass, _QuantAttention transparently intercepts the SDPA call and inserts four quantizers around the fused kernel:

q_bmm_quantizer, k_bmm_quantizer, v_bmm_quantizer wrap the projected Q / K / V tensors before they enter the kernel
bmm2_output_quantizer wraps the kernel output (softmax @ V) before it flows into out_proj

This ensures proper quantization throughout the attention mechanism.

To restore some accuracy, it is often advised to disable some of the quantizers using mtq.disable_quantizer. This takes a function as input, where the function itself takes a module name as input. By using regex or string matching, you can select the layers to disable. In the following example, the quantizers are disabled in the patch_embedding layer of the CLIP model.

import re
def filter_func(name):
    pattern = re.compile( 
        r".*(patch_embedding).*"
    )
    return pattern.match(name) is not None

mtq.disable_quantizer(q_model, filter)

CLIP benchmark evaluation

The saved ModelOpt checkpoint can be restored into any downstream evaluation script. For details, reference Restoring ModelOpt Models. The quantized CLIP checkpoint was evaluated on three benchmarks: zero-shot classification (CIFAR-100, ImageNet-1k) and zero-shot retrieval (MS-COCO Captions). The FP16 CLIP model serves as the baseline.

Based on the evaluation results, the CLIP-FP8 quantized model demonstrates comparable quality to the CLIP-FP16 model. Notably, when quantizers are disabled in the patch embedding layer, the impact of quantization for model quality becomes negligible.

Inside the ModelOpt PTQ flow

It’s important to understand that this stage involves working with “fake quantization” because the actual data type of the model hasn’t changed. Instead, these inserted quantizers act as observers that simulate the effects of quantization while keeping the model in original floating-point format.

The fake quantization process works in two key ways:

Statistics collection: The quantizers collect tensor statistics (minimum and maximum values, for example) as data passes through them. These statistics are used to calculate optimal quantization parameters such as scaling factors.
Quantization simulation: The quantizers perform a quantize-then-dequantize (QDQ) operation on tensors flowing through the network. It only simulates the low-precision computation and real speedup and memory saving should be achieved by exporting the model to deployment frameworks such as NVIDIA TensorRT.

This simulation is crucial because it enables you to evaluate the model’s accuracy before committing to actual quantization. The quantizers apply the same rounding and precision limitations that would occur in the deployed quantized model with downstream inference frameworks, so you can:

Measure accuracy impacts before deployment
Experiment with different quantization configurations
Identify problematic layers that might need special handling

In general, the ModelOpt PTQ flow follows six stages:

Prepare: Set quantization config to insert quantizer modules around the model’s weights and/or activations.
Calibrate: Forward a small batch of representative data through the model so each quantizer can collect statistics (for example, activation amax) and derive its scaling factor.
Fake quantization: Quantizers now apply a Q → DQ round-trip in floating point, faithfully simulating the precision loss of the target format while the model still runs in FP16/BF16.
Evaluate: Measure accuracy on a held-out evaluation set and compare against the unquantized baseline.
Iterate: If the gap is unacceptable, adjust the quantization configuration (granularity, algorithm, quantized layers), disable quantization for sensitive layers, and recalibrate.
Export and deploy: Once the accuracy is acceptable, the fake quantized weights are compressed into their true low-precision form and exported as a checkpoint for downstream engines. In our case, we export the PyTorch checkpoint to ONNX and run inference with TensorRT. The speedups and memory savings will happen there.

QAT recovers quantization-induced quality loss by fine-tuning the model weights with frozen quantizer states. It is more compute-intensive than PTQ but can better improve quantized model quality. For more details, see the ModelOpt examples.

Get started with NVIDIA Model Optimizer

This post introduced NVIDIA Model Optimizer and demonstrated a typical post-training quantization workflow by quantizing the CLIP model to FP8 with a practical code example. The results across three evaluation datasets show that FP8 quantization can preserve model quality while enabling a more efficient deployment path.

Ready to start using ModelOpt with your own models? Follow this workflow: prepare the model and calibration data, set quantization configuration, calibrate, validate the quantized model against task-specific quality metrics, save and restore ModelOpt checkpoints.

To explore additional workflows and adapt ModelOpt for your own use cases, see the ModelOpt documentation.