Generative AI

A Fine-tuning–Free Approach for Rapidly Recovering LLM Compression Errors with EoRA

Model compression techniques have been extensively explored to reduce the computational resource demands of serving large language models (LLMs) or other large-size neural networks. 

However, most existing methods either incur significant accuracy degradation compared to uncompressed models or have long training times. Also, their adaptability is often constrained by a limited range of hardware-supported compression formats (for example, 2:4 sparsity, 3/4-bit quantization), making it difficult to address various user requirements for accuracy and efficiency. 

NVIDIA Research Taiwan, Learning & Perception Research Group, AI Accelerator & VLSI Research Group, and NeMo Group reframe model compression as customized compensation. They developed Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (EoRA), which introduces residual low-rank paths to compensate for compression errors caused by various compression techniques under diverse user needs, such as tasks or compression ratios.

As a fine-tuning–free optimization method, EoRA requires no gradient computation and can be completed within a few minutes using minimal calibration data. It can also serve as a good starting point for fine-tuning, and remains robust to quantization to further reduce overhead.

EoRA effectively compensates the compressed LLMs on language generation, commonsense reasoning, and math tasks. It consistently outperforms previous SVD-based approaches, especially for aggressively compressed (including pruned, quantized, and both) models. For example, we saw  4.53%, 3.48%, and 11.83% improvement on ARC-Challenge, MathQA, and GSM8K when compensating 2:4-pruned Llama3-8B. Moreover, the EoRA module remains resilient under 3/4-bit quantization with minimal accuracy drop, underscoring its practicality in compensating for compression errors.

A diagram shows that EoRA can improve the accuracy of the 2:4 pruned Llama3-8B model by 4.53% on ARC-C, 3.48% on MathQA, and 11.83% on GSM8K.
Figure 1. Overview of a proposed model compensation framework, EoRA

How does EoRA work?

Compared with standard model compression techniques and algorithms, model compensation introduces residual low-rank paths to compensate for compression errors \Delta W, resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. 

To derive the low-rank residual paths that can represent compression errors \Delta W, one naive method is to directly derive a closed-form solution by using singular value decomposition (SVD). However, naively applying SVD fails to account for the varying importance of individual model weights, resulting in suboptimal utilization of the low-rank representation capacity. 

To address this problem, EoRA projects the compression error \Delta W into the eigenspace of the corresponding layer’s input activations X, ensuring a direct relationship between the error approximation loss and the overall layer-wise model compression loss. 

More specifically, we first perform the eigendecomposition on the input activations X from the calibration set to derive the eigenspace projection matrix Q^{'} by eigenvectors and eigenvalues. We then project the compression error \Delta W into the eigenspace with the projection matrix Q^{'} to obtain the projected error \Delta W^{'} = \Delta WQ^{'}

SVD is then applied to \Delta W, approximating the solution B^{'} and A^{'} in the eigenspace. In this way, the approximation ensures that error columns associated with larger eigenvalues are approximated more accurately than those with smaller eigenvalues, facilitating a more effective allocation of the insufficient low-rank expressive power. 

Finally, we project A^{'} back using the inverse projection matrix {Q^{'}}^{-1} to obtain A = A^{'}{Q^{'}}^{-1} and use B^{'} and A to approximate compression error \Delta W in the original space since the following is true: 

latex forumula

Figure 2 shows the whole process.

A diagram shows the whole process of EoRA. EoRA first performs the eigendecomposition on the input activations from the calibration set to derive the eigenspace projection matrix, which is used to project the compression error into eigenspace to obtain the projected error. SVD is then applied to the projected error to approximate two low-rank matrices in the eigenspace. Finally, we use the inverse projection matrix to project back to the original space of the neural network. 
Figure 2. Proposed EoRA, which projects the compression error \Delta W into the eigenspace of input activations X and performs low-rank approximation on projected error \Delta W

The overall fine-tuning–free optimization in EoRA can be done in minutes using only a small amount of calibration data without any gradient computation or time-consuming machine learning approaches. 

EoRA can provide better initialization for fine-tuning to further enhance accuracy with the compression technique and offer a trade-off between accuracy and training time. 

EoRA is also robust to quantization, which can further reduce the additional cost of residual low-rank compensation paths.

For more information about the detailed algorithm and mathematical analyses, see EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation.

Performance

EoRA is compatible with various compression techniques, including pruning, quantization, and both. It consistently outperforms previous SVD-based methods on various tasks, such as language generation, commonsense reasoning, and math tasks. 

EoRA can also provide better initialization for fine-tuning, and is robust to quantization and rank numbers.

Compression error compensation

The scores in Tables 1-3 show that EoRA significantly and consistently outperforms the SVD-based baseline method, ZeroQuant-V2, on the model compressed by various compression techniques, including pruning (Table 1), quantization (Table 2), and both (Table 3).

EoRA also works on different transformer backbones, such as Llama2, Llama3, and so on. For more information, see EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation.

Pruning  methodSparsityCompensation method Wikitext2 (↓)ARC-C (↑) MathQA (↑) GSM8K (↑) 
UncompressedN/AN/A6.1350.440.136.2
SparseGPT2:412.3230.126.42.1
ZeroQuant-V211.3132.026.53.0
EoRA11.07 (-0.24) 34.6 (+2.6) 29.9 (+3.4)13.9(+10.9)
Wanda2:421.4 27.025.10.8
ZeroQuant-V217.2 30.526.21.3
EoRA14.0(-3.2) 34.8(+4.3)30.0(+3.8)11.5(+10.2)
Table 1. Perplexity and commonsense/math reasoning results for pruned LLama3-8B
Quantization  methodX-bitCompensation method Wikitext2 (↓)ARC-C (↑) MathQA (↑) GSM8K (↑) 
UncompressedN/AN/A6.1350.440.136.2
GPTQ3-bit15.64 20.922.40.4
ZeroQuant-V210.24 30.026.43.8
EoRA10.06 (-0.18)31.7 (+1.7)29.1 (+2.7)11.9(+8.1)
Table 2. Perplexity and commonsense/math reasoning results for quantized LLama3-8B
Compression   methodX-bitSparsity Compensation method Wikitext2 (↓)ARC-C (↑) MathQA (↑) GSM8K (↑) 
UncompressedN/AN/AN/A6.1350.440.136.2
GPTQ + SparseGPT4-bit2:486.15 18.319.90.0
ZeroQuant-V212.84 29.426.91.6
EoRA12.60 (-0.24)31.2(+1.8)29.6(+2.7)10.2(+8.6)
Table 3. Perplexity and commonsense/math reasoning results for aggressively compressed LLama3-8B

Fine-tuning with EoRA

You can fine-tune EoRA to further recover the accuracy loss of the compressed models, showing more significant improvements than baseline methods.

Compression  methodConfig InitializationARC-C (↑) MathQA (↑) 
UncompressedN/Aw/o fine-tuning50.440.1
Standard56.453.6
GPTQ3-bitw/o fine-tuning20.922.4
QLoRA30.334.1
LoftQ44.748.2
EoRA47.4 (+2.7)53.9 (+5.7)
SparseGPT2:4w/o fine-tuning30.126.4
QLoRA41.345.4
LoftQ43.748.8
EoRA48.5 (+4.8)54.7 (+5.9)
Table 4. Fine-tuning results for compressed LLama3-8B

Table 4 shows various compression settings using different initializations of low-rank matrices. The scores show the improvement over the baselines, QLoRA and LoftQ. The fine-tuned model shows competitive results with the uncompressed full-precision model, and even surpasses the accuracy of the fine-tuned full-precision model on MathQA.

Compensation with different ranks

EoRA consistently outperforms the SVD-based baseline method, ZeroQuant-V2, across different ranks, with the improvement becoming slightly more pronounced at higher ranks. The results prove that EoRA is robust across different rank settings, offering you a more flexible option upon existing compression configurations to effectively balance the trade-off between inference overhead and model accuracy.

A diagram shows that EoRA consistently outperforms SVD across different ranks on various datasets, including ARC-C, MathQA, and GSM8K.
Figure 3. Rank vs. accuracy on three datasets 

Quantization of EoRA

The neural network of EoRA can also be quantized to further reduce the additional cost of residual low-rank compensation paths. 

Figure 4 shows that EoRA is robust to quantization, which means that when EoRA is quantized, the accuracy drop from full-precision EoRA is insignificant, while the model size is significantly reduced. For example, when a 512-rank EoRA is quantized from 16 bits to 4 bits on 2:4 pruned Llama3-8B, the accuracy drops are only 0.43% on ARC-C, while the total model size reduces by 16.5%. 

Generally, we recommend that you quantize EoRA to 4 bits, as this significantly reduces inference latency and model size without causing any noticeable drop in accuracy.

A diagram shows the accuracy and model size trade-off for different quantized EoRA neural networks on three types of compressed Llama3-8B models corresponding to different compression algorithms. One example shows that when a 512-rank EoRA is quantized from 16-bits to 4-bit on a 2:4 pruned model, the accuracy drops are only 0.43% on ARC-C while the total model size reduces by 16.5%. 
Figure 4. Quantizing EoRA of rank {128, 512} to 4/3-bit on compensating three types of compressed Llama3-8B models (2:4 pruned, 4-bit quantized, and 3-bit quantized).

Open-source impact

EoRA has been seamlessly integrated into the open-source library GPTQModel, which is the default LLM model compression and quantization toolkit with accelerated inference support for both CPU and GPU through Hugging Face, vLLM, and SGLang. 

This integration enables you to easily enhance the accuracy of your quantized models with the EoRA method as simply as turning this feature on as a toggle. All model quantization users who use Hugging Face, vLLM, and SGLang can easily use our EoRA work to improve their overall model performance. 

The following Python code example runs Quantization + EoRA Accuracy Recovery:

from gptqmodel import BACKEND, GPTQModel
from gptqmodel.adapter.adapter import Lora

eora = Lora(
    # for eora generation, path is adapter save path; for load, it is loading path
    path='GPTQModel/examples/eora/Llama-3.2-3B-4bits-eora_rank64_32',
    rank=32,
)

model = GPTQModel.load(
    model_id_or_path='USER_FOLDER/Llama-3.2-3B_4bits_128group_size',
    adapter=eora,
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)
print(f"Result: {result}")

Table 5 shows that EoRA substantially improves the accuracy of 3/4-bit quantized models on MMLU. The experiments are zero-shot as the calibration dataset (C4) has no overlap with the testing dataset (MMLU). The accuracy boost percentage shows the ratio between the quantized + EoRA model and the quantized model. For more information, see the /ModelCloud/GPTQModel GitHub repo.

X-bitEoRA calibration set EoRA rankMMLU (↑)MMLU accuracy boost
16-bit (full precision)N/AN/A54.2N/A
4-bit24.2N/A
C43252.5217%
3-bit22.9N/A
C43239.1171%
Table 5. MMLU results for quantized LLama 3.2-3B

This technique has also been adopted to significantly boost the accuracy for 2-bit quantized Qwen3 and Qwen2.5. For more details, please refer to the blog: Boost 2-Bit LLM Accuracy with EoRA.

Summary

EoRA presents a scalable, versatile solution for model compensation, with potential applications across various domains where efficient deployment of large models or neural networks is crucial. 

The key strength of EoRA lies in its training-free nature, enabling rapid optimization using only a small calibration dataset, and its robustness to quantization, making it an effective tool for deploying large models with varying capacity requirements. EoRA provides a solid initialization for fine-tuning, further reducing accuracy degradation and, in some cases, surpassing the performance of uncompressed models. 

EoRA demonstrates significant improvements in language generation, commonsense reasoning, and mathematical reasoning tasks, outperforming traditional low-rank approximation techniques such as SVD. We hope EoRA can help NVIDIA efficiently and effectively boost the performance of compressed large models and benefit diverse applications in NVIDIA Metropolis, NVIDIA NeMo, NVIDIA NIM, NVIDIA TensorRT, computer vision, generative AI, robotics, and more.

For more information, see the following resources:

Discuss (0)

Tags