Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer

Large language models (LLMs) have set a high bar in natural language processing (NLP) tasks such as coding, reasoning, and math. However, their deployment remains resource-intensive, motivating a growing interest in small language models (SLMs) that offer strong performance at a fraction of the cost.

NVIDIA researchers and engineers have demonstrated a method that combines structured weight pruning with knowledge distillation, a powerful strategy for compressing large models into smaller, efficient variants without significant loss in quality. For more details, see Compact Language Models via Pruning and Knowledge Distillation.

This post explains model pruning and knowledge distillation, how they work, and how you can easily apply them to your own models to achieve optimal performance using NVIDIA TensorRT Model Optimizer.

What is model pruning?

Pruning is a model optimization technique that leverages the common over-parameterization of neural networks occurring from training models with enough capacity to learn complex features and ensure smooth convergence. Pruning systematically identifies and removes unimportant parameters such as weights, neurons, or even layers from a trained model.

This process can often eliminate large amounts of a model’s weights with minimal impact on accuracy, directly translating to a more compact model with accelerated inference speeds and lower computational cost. Similar to how an arborist trims a tree to improve its health and growth, model pruning makes a model smaller and more efficient.

Depth pruning and width pruning are the two main approaches.

Depth pruning removes entire layers from the neural network, reducing the overall depth and complexity (Figure 1).

This figure illustrates depth-pruning in a neural network. Since all the neurons in the first hidden layer are identified as unimportant or redundant, the entire layer is targeted for removal, which will significantly reduce the network's overall depth and complexity. — *Figure 1. Depth pruning a neural network reduces overall depth and complexity*

Width pruning eliminates internal structures such as individual neurons, attention heads, or embedding channels, slimming down the model’s width (Figure 2).

The original neural network, with a highlighted neuron identified as redundant. Its subsequent removal will reduce the width of this layer, creating a smaller, more computationally efficient model. — *Figure 2. Reducing layer width by pruning an unimportant neuron*

The core idea is to identify and remove parts of the LLM that contribute the least to its overall performance. Different methods are used to assess the importance of different components, such as:

Magnitude pruning: Sets weights with small absolute values close to zero.
Activation-based pruning: Uses a calibration dataset to estimate the importance of different parts of the model based on their activations.
Structural pruning: Removes entire structures, like layers or attention heads.

Research shows that width pruning typically achieves better accuracy than depth pruning, though depth pruning often reduces inference latency more at the same number of parameters. The choice between depth pruning, width pruning, or a combination of both should depend on the desired balance between accuracy and latency. For more information, see LLM Pruning and Distillation in Practice: The Minitron Approach.

What is knowledge distillation?

Knowledge distillation is a model compression technique that transfers knowledge from a larger “teacher” model to a smaller and more efficient “student” model (Figure 3). The goal is to create a compact model that retains the high performance of the larger model, making it suitable for deployment at a lower resource cost.

This diagram shows the successful outcome of knowledge distillation by comparing the teacher network to the smaller, trained student network. The student model, despite being more compact, produces an output probability vector that closely mimics the teacher's vector. — *Figure 3. Knowledge distillation trained student and teacher model outputs*

Knowledge distillation trains a compact student model to emulate a larger teacher, not by relying solely on hard labels, but by learning from the teacher’s guidance. This transfers rich, generalizable behavior so the student approaches the teacher’s accuracy while running far more efficiently.

Two common distillation styles, response-based and feature-based, differ in how each passes knowledge from teacher to student.

What is response-based knowledge distillation?

Response-based knowledge distillation transfers a teacher model’s knowledge to a student by training the student to match the teacher’s soft output probabilities rather than only hard labels. These soft targets convey inter-class similarities, for example that “cat” is closer to “tiger” than to “car,” and the student is optimized to align with them using KL divergence.

The approach is simple to implement, requires no access to the teacher’s internal features, and is highly effective for classification tasks. In practice, it’s common to combine the distillation loss with standard cross-entropy on ground-truth labels and tune the loss weights to balance stability and fidelity, yielding compact models that preserve much of the teacher’s accuracy.

This figure illustrates the training loop of a student neural network using response-based distillation. The student network's softened output probabilities, its "soft predictions," are compared against the teacher model's "soft targets." The resulting loss is then used to update the student network's weights via backpropagation, as indicated by the "Update Weights" arrow. — *Figure 4. Student learning from a teacher’s soft targets through output comparison*

What is feature-based knowledge distillation?

Feature-based knowledge distillation transfers a teacher’s intermediate representations hidden activations or feature maps to guide a student toward learning similar internal structure, not just similar outputs. During training, selected teacher and student layers are paired and aligned, projection layers are often used when dimensions differ.

This deeper, layer-level supervision provides richer signals than response-based KD and has proven effective across vision (CNN feature maps, for example) and NLP (Transformer hidden states and attentions, for example). Because it relies on internal activations, this technique requires access to the teacher’s intermediate layers and careful layer selection and weighting alongside the standard task loss to balance stability and accuracy.

This figure illustrates the training loop of a student neural network using output-based distillation. The student network's hidden layer output feature map is compared against the teacher model's hidden layer’s feature map using a loss function. This is then used to update the student network's weights through backpropagation, as indicated by the "Update Weights" arrow. — *Figure 5. Student learning from a teacher’s hidden layer’s feature map comparison*

Pruning and distillation form a powerful pipeline for model compression, enabling the creation of SLMs that are well-suited for deployment in production environments and edge applications. TensorRT Model Optimizer streamlines applying these techniques at scale, turning state-of-the-art LLMs into deployable, cost-effective solutions.

How to prune a model using TensorRT Model Optimizer

This section walks you through how to build a pipeline using TensorRT Model Optimizer. It includes dataset preparation, fine-tuning a teacher model on the WikiText dataset, and applying pruning and distillation techniques to produce a 6B-parameter model from Qwen3-8B. For more information, see the Qwen3-8B Pruning and Distillation with NeMo 2.0 Framework notebook.

Prior to pruning and distillation, it is necessary to convert Hugging Face models to the NVIDIA NeMo checkpoint format and preprocess the dataset. For detailed instructions, refer to the model conversion and data preparation step.

Here, we will demonstrate how to prune using both the depth pruning and width pruning approaches. The scripts provided can be run inside the NVIDIA NeMo framework container nvcr.io/nvidia/nemo:25.09.

How to depth prune the model to create a student

The initial approach involves trimming the Qwen3 8B model from 36 to 24 layers (about 6B parameters) by automatically selecting the best 24 layers to keep using a small calibration dataset of 1,024 samples.

The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup.

torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py \
    --devices 2 \
    --pp_size 2 \
    --restore_path Qwen3-8B-nemo \
    --legacy_ckpt \
    --save_path Qwen3-8B-nemo-depth-pruned \
    --seq_length 4096 \
    --num_train_samples 1024 \
    --mbs 4 \
    --data_paths wikitext-data/wikitext-train_text_document \
    --target_num_layers 24

How to width prune the model to create a student

The second, alternative approach to model size reduction involves width pruning. This is achieved by shrinking key architectural components: the MLP intermediate (ffn_hidden_size) is reduced from 12,288 to 9,216, and the Embedding (hidden_size) from 4,096 to 3,584, also resulting in a 6B model.

Further reductions in the number of attention heads (num_attention_heads) and GQA query groups (num_query_groups) can be implemented as needed. The layer count (num_layers) may also be adjusted to achieve the desired model size.

The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup.

torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py \
    --devices 2 \
    --pp_size 2 \
    --restore_path Qwen3-8B-nemo \
    --legacy_ckpt \
    --save_path Qwen3-8B-nemo-width-pruned \
    --seq_length 4096 \
    --num_train_samples 1024 \
    --mbs 4 \
    --data_paths wikitext-data/wikitext-train_text_document \
    --target_ffn_hidden_size 9216 \
    --target_hidden_size 3584

By trimming redundant or low-importance weights, pruning not only shrinks the model’s memory footprint but can also speed up inference. However, this process is typically followed by fine-tuning or retraining to recover any accuracy lost during the pruning phase and to ensure the pruned model maintains high performance on target tasks. This is where distillation comes in.

How to use TensorRT Model Optimizer for distillation

This example distills the Qwen3 depth- and width-pruned models using knowledge distillation with Model Optimizer and the NeMo 2.0 Framework.

When distilling knowledge from the teacher model to a depth-pruned model, the path of the student model will be Qwen3-8B-nemo-depth-pruned. This path corresponds to the output of the depth-pruning step, as detailed in the NeMo distillation notebook.

The script for this process is provided below, showing how to distill using a single-node eight-GPU Tensor Parallel setup. In practice, we recommend multinode training for faster training.

torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py \
    --name Qwen3-8B-nemo-depth-pruned-distill \
    --devices 8 \
    --num_nodes 1 \
    --tp_size 8 \
    --model_path Qwen3-8B-nemo-depth-pruned \
    --teacher_path Qwen3-8B-nemo \
    --legacy_ckpt \
    --max_steps 40 \
    --warmup_steps 1 \
    --gbs 768 \
    --mbs 8 \
    --lr 1e-4 \
    --min_lr 1e-5 \
    --seq_length 4096 \
    --log_dir . \
    --log_interval 5 \
    --val_check_interval 5 \
    --limit_val_batches 2 \
    --data_paths wikitext-data/wikitext-train_text_document

While distilling knowledge from the teacher to the width-pruned model, the student_model_path model would be Qwen3-8B-nemo-width-pruned as produced by the width-pruning step in the NeMo pruning notebook. Further details found in the NeMo distillation notebook.

The script for this process is provided below, showing how to distill using a single-node eight-GPU tensor parallel setup. In practice, we recommend multinode training for faster training.

torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py \
    --name Qwen3-8B-nemo-width-pruned-distill \
    --devices 8 \
    --num_nodes 1 \
    --tp_size 8 \
    --model_path Qwen3-8B-nemo-width-pruned \
    --teacher_path Qwen3-8B-nemo \
    --legacy_ckpt \
    --max_steps 40 \
    --warmup_steps 1 \
    --gbs 768 \
    --mbs 8 \
    --lr 1e-4 \
    --min_lr 1e-5 \
    --seq_length 4096 \
    --log_dir . \
    --log_interval 5 \
    --val_check_interval 5 \
    --limit_val_batches 2 \
    --data_paths wikitext-data/wikitext-train_text_document

For more comprehensive information, see the NeMo Framework distillation documentation. These resources will help you easily enable and integrate distillation into your workflow.

How do pruning and distillation impact model performance?

Experimental results for pruning and distillation from Qwen3 8B using Model Optimizer show that Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model, and it also performs better on the MMLU (Massive Multitask Language Understanding) benchmark. Depth pruning was applied to reduce the model from 36 to 24 layers, resulting in a 6B model, using one NVIDIA H100 80 GB HBM3.

The Pruned model is distilled from Qwen3-8B using the OptimalScale/ClimbMix data processed from nvidia/ClimbMix pretraining dataset. The experiment uses 25% of the data, which is approximately 90B tokens. Distillation takes 8 hours with 96 nodes, each having eight NVIDIA H100 GPUs (6K GPU hours).

Plot showing that Qwen3-8B-DPruned-6B model is 30% faster than the Qwen3-4B model, and also performs better on the MMLU benchmark (72.5 versus 70). — *Figure 6. The Qwen3 Depth Pruned 6B model outperforms 4B on both speed and accuracy and approaches 8B accuracy while running much faster*

The 6B pruned model demonstrates a significant advancement in performance compared to its 4B counterpart. Notably, the 6B pruned model achieves a 30% increase in speed, making it considerably more efficient for various computational tasks. For Throughput comparison, all models are quantized to FP8 precision using Model Optimizer and run with TensorRT-LLM.

Beyond its speed advantage, the 6B pruned model also exhibits superior accuracy, as evidenced by its higher score on the MMLU benchmark. With a score of 72.5, it surpasses the 4B model’s score of 70.0, indicating a better understanding and capability across a broad range of language-related tasks.

This dual improvement in both speed and accuracy positions the 6B pruned model as a more robust and effective solution for applications requiring both rapid processing and high-quality results.

The pruned models were distilled on a pretraining dataset, so the model is a base variant. Having a base model, we only compared all the models on base model benchmarks such as MMLU. Practically using these models for reasoning tasks would require performing post-training on the models as well.

Get started with pruning and knowledge distillation

Pruning and knowledge distillation are highly cost-effective methods to progressively shrink LLMs while matching or exceeding baseline accuracy across domains, and they’re typically more data-efficient than either synthetic-data fine-tuning or full pretraining.

Ready to get started? Check out the Qwen3 8B Pruning and Distillation with NeMo 2.0 Framework notebook. Visit the NVIDIA/TensorRT-Model-Optimizer GitHub repo to learn more about pruning and distillation. For more information about model optimization techniques using TensorRT Model Optimizer, see related posts on post-training quantization, quantization-aware training, and speculative decoding.