As AI models get larger and architectures more complex, researchers and engineers are continuously finding new techniques to optimize the performance and overall cost of bringing AI systems to production.
Model optimization is a category of techniques focused on addressing inference service efficiency. These techniques represent the best “bang for buck” opportunities to optimize cost, improve user experience, and scale. These techniques range from fast and effective approaches like model quantization to powerful multistep workflows like pruning and distillation.
This post covers the top five model optimization techniques enabled through NVIDIA Model Optimizer and how each contributes to improving the performance, TCO, and scalability of deployments on NVIDIA GPUs.
These techniques are the most powerful and scalable levers currently available in Model Optimizer that teams can apply immediately to reduce cost per token, improve throughput, and accelerate inference at scale.

1. Post-training quantization
Post-training quantization (PTQ) is the fastest path to model optimization. You can leverage an existing model (FP16/BF16/FP8) and compress it to a lower precision format (FP8, NVFP4, INT8, INT4) using a calibration dataset—without touching the original training loop. This is where most teams should begin. It is easy to apply with Model Optimizer, and delivers immediate latency plus throughput wins, even on massive foundation models.

| Pros | Cons |
| –Fastest time to value –Achievable with small calibration dataset –Memory, latency, and throughput gains stack with other optimizations –Highly custom quantization recipes (NVFP4 KV Cache, for example) | –May require a different technique (QAT/QAD) if the quality floor drops under SLA |
To learn more, see Optimizing LLMs for Performance and Accuracy with Post-Training Quantization.
2. Quantization-aware training
Quantization-aware training (QAT) injects a short, targeted fine-tuning phase where the model is tuned to account for low precision error. It simulates quantization noise in the forward loop while computing gradients in higher precision. QAT is a recommended next step when additional accuracy is required beyond what PTQ has delivered.

| Pros | Cons |
| –Recovers all or most of the accuracy loss at low precision –Fully compatible with NVFP4, especially for FP4 stability | –Requires training budget plus data –Takes longer to implement than PTQ alone |
To learn more, see How Quantization-Aware Training Enables Low-Precision Accuracy Recovery.
3. Quantization-aware distillation
Quantization-aware distillation (QAD) goes one level beyond QAT. With this technique, the student model learns to account for quantization errors while simultaneously aligned to the full precision teacher through distillation loss. QAD doubles down on QAT by adding teaching elements from principles of distillation, enabling you to extract the maximum quality possible while running ultra-low precision at inference time. QAD is an effective option for downstream tasks that notoriously suffer from significant performance degradation after quantization.

| Pros | Cons |
| –Highest accuracy recovery –Ideal for multistage post-training pipelines for easy setup and robust convergence | –Additional training cycles after pretraining –Larger memory footprint –Slightly more complex pipeline to implement today |
To learn more, see How Quantization-Aware Training Enables Low-Precision Accuracy Recovery.
4. Speculative decoding
The decode step in inference is well known for suffering from sequential processing algorithmic bottlenecks. Speculative decoding tackles this directly by using a smaller or faster draft model (like EAGLE-3) to propose multiple tokens ahead, then verifying them in parallel with the target model. This collapses sequential latency into single steps and dramatically reduces required forward passes at long sequence lengths, without touching model weights.
Speculative decoding is recommended when you want immediate generation speedups without retraining or quantization, and it stacks cleanly with the other optimizations in this list to compound throughput and latency gains.

| Pros | Cons |
| –Radically reduces decode latency –Stacks perfectly with PTQ/QAT/QAD plus NVFP4 | –Requires tuning (acceptance rate is everything) –Second model or head required depending on variant |
To learn more, see An Introduction to Speculative Decoding for Reducing Latency in AI Inference.
5. Pruning plus knowledge distillation
Pruning is a structural optimization path. This technique removes weights, layers, and/or heads to make the model smaller. Distillation then teaches the new smaller model how to think like the larger teacher. This multistep optimization strategy permanently changes model performance because the baseline compute and memory footprint are permanently lowered.
Pruning plus knowledge distillation can be leveraged when other techniques in this list are unable to deliver the memory or compute savings necessary to meet application requirements. This approach can also be used when teams are open to making more aggressive changes to an existing model to adapt it for specific specialized downstream use cases.

| Pros | Cons |
| –Reduces parameter count → permanent plus structural cost savings –Enables smaller models that still behave like large models | –Aggressive pruning without distill → cliffs accuracy –Requires more work to pipeline versus PTQ alone |
To learn more, see Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer.
Get started with AI model optimization
Optimization techniques come in all shapes and sizes. This post highlights the top five model optimization techniques enabled through Model Optimizer.
- PTQ, QAT, QAD, and pruning plus distillation make your model intrinsically cheaper, smaller, and more memory efficient to operate.
- Speculative decoding makes generation intrinsically faster by collapsing sequential latency.
To get started and learn more, explore the deep-dive posts associated with each technique for technical explainers, performance insights, and Jupyter Notebook walkthroughs.