Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient fine-tuning (PEFT) methods have been introduced to fine-tune pretrained models with a minimal number of parameters. Among these, Low-Rank Adaptation (LoRA) and its variants have gained considerable popularity because they avoid additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning.
NVIDIA Research Taiwan and the NVIDIA Learning and Perception Research Group developed Weight-Decomposed Low-Rank Adaptation (DoRA), which could be the default replacement for LoRA. DoRA improves both the learning capacity and stability of LoRA, without introducing any additional inference overhead.
DoRA consistently outperforms LoRA across a wide variety of large language model (LLM) and vision language model (VLM) tasks, such as common-sense reasoning (+3.7/+1.0 on Llama 7B/13B, +2.9 on Llama 2 7B, and +4.4 on Llama 3 8B), Multi-Turn (MT) Benchmark (+0.4/+0.3 for Llama/Llama 2 7B), image/video-text understanding (+0.9/+1.9 on VL-BART), and visual instruction tuning (+0.6 on LLaVA 7B). DoRA has also been demonstrated in other tasks, including compression-aware LLM and text-to-image generation. This work has been accepted to ICML 2024 as an oral paper (1.5% acceptance rate).
How does DoRA work?
DoRA begins by decomposing the pretrained weight into its magnitude and directional components and then fine-tunes both. Given the substantial size of the directional component in terms of parameters, DoRA exploits LoRA for directional adaptation to enable efficient fine-tuning, as illustrated in Figure 2. Finally, DoRA can be merged with the pretrained weight before inference, thereby avoiding the introduction of additional latency.
How does DoRA affect model training?
To investigate how DoRA affects model training, the magnitude and directional differences (∆D, ∆M) between the DoRA weight W’ and the pretrained weight W0 are visualized in Figure 3 (so as FT and LoRA). From the regression line for (∆D, ∆M) of both DoRA and FT, a distinct negative slope characterizes DoRA and FT, instead of a clear positive correlation shown by LoRA. Different markers represent matrices of different training steps and different colors represent the matrices of each layer.
DoRA demonstrates the ability to make only substantial directional adjustments with relatively minimal changes in magnitude or the reverse, while showing learning patterns closer to FT. This signifies its superior learning capacity over LoRA. For more qualitative and mathematical analyses, see DoRA: Weight-Decomposed Low-Rank Adaptation.
Performance
DoRA outperforms LoRA across a wide variety of models, including LLM, VLM, compressed LLM, and diffusion models.
Large language models
DoRA significantly outperforms LoRA in terms of the overall commonsense reasoning ability, as shown in Table 1. Moreover, DoRA can provide better conversation and instruction-following capabilities than LoRA, as demonstrated by the MT Benchmark in Table 2.
Model | # Params (%) | BoolQ | PIQA | SIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA | Avg. |
ChatGPT-3.5 | – | 73.1 | 85.4 | 68.5 | 78.5 | 66.1 | 89.8 | 79.9 | 74.8 | 77.0 |
Llama-LoRA | 0.83 | 68.9 | 80.7 | 77.4 | 78.1 | 78.8 | 77.8 | 61.3 | 74.8 | 74.7 |
Llama-DoRA (Ours) | 0.84 | 69.7 | 83.4 | 78.6 | 87.2 | 81.0 | 81.9 | 66.2 | 79.2 | 78.4 |
Llama 2-LoRA | 0.83 | 69.8 | 79.9 | 79.5 | 83.6 | 82.6 | 79.8 | 64.7 | 81.0 | 77.6 |
Llama 2-DoRA (Ours) | 0.84 | 72.0 | 83.1 | 79.9 | 89.1 | 83.0 | 84.5 | 71.0 | 81.2 | 80.5 |
Llama 3-LoRA | 0.83 | 70.8 | 85.2 | 79.9 | 91.7 | 84.3 | 84.2 | 71.2 | 79.0 | 80.8 |
Llama 3-DoRA (Ours) | 0.84 | 74.6 | 89.3 | 79.9 | 95.5 | 85.6 | 90.5 | 80.4 | 85.8 | 85.2 |
Model | # Params (%) | Score |
Llama-LoRA | 2.31 | 5.1 |
Llama-DoRA (Ours) | 2.33 | 5.5 |
Llama-VeRA | 0.02 | 4.3 |
Llama-DVoRA (Ours) | 0.04 | 5.0 |
Llama 2-LoRA | 2.31 | 5.7 |
Llama 2-DoRA (Ours) | 2.33 | 6.0 |
Llama 2-VeRA | 0.02 | 5.5 |
Llama 2-DVoRA (Ours) | 0.04 | 6.0 |
Vision language models
In addition to pure natural language processing (NLP), DoRA also outperforms LoRA in terms of image-text understanding (Table 3), video-text understanding (Table 4), and visual instruction tuning (Table 5) abilities.
Model | # Params (%) | VQAv2 | GQA | NVLR2 | COCO Cap. | Avg. |
VLBART-LoRA | 5.93 | 65.2 | 53.6 | 71.9 | 115.3 | 76.5 |
VLBART-DoRA (Ours) | 5.96 | 65.8 | 54.7 | 73.1 | 115.9 | 77.4 |
Model | # Params (%) | TVQA | How2QA | TVC | YC2C | Avg. |
VLBART-LoRA | 5.17 | 75.5 | 72.9 | 44.6 | 140.9 | 83.5 |
VLBART-DoRA (Ours) | 5.19 | 76.3 | 74.1 | 45.8 | 145.4 | 85.4 |
Model | # Params (%) | VQAv2 | GQA | Vis-Wiz | SQA | VQAT | POPE | MMBench | Avg. |
LLaVA-LoRA | 4.61 | 79.1 | 62.9 | 47.8 | 68.4 | 58.2 | 86.4 | 66.1 | 66.9 |
LLaVA-DoRA (Ours) | 4.63 | 78.6 | 62.9 | 52.2 | 69.9 | 57.0 | 87.2 | 66.1 | 67.6 |
Compression-aware LLMs
To further decrease the memory demands of PEFT fine-tuning, QLoRA suggests quantizing the pretrained model to 4-bit and fine-tuning LoRA on top of the frozen low-bit backbone. With DoRA, which narrows the gap between LoRA and FT, it is natural to also explore whether DoRA can enhance the accuracy of LoRA within the QLoRA framework.
Recently, our team collaborated with several researchers in Answer.AI on their QDoRA project, which substitutes the LoRA component in QLoRA with DoRA. The results show that QDoRA outperforms FT, QLoRA on both Llama 2 and Llama 3, respectively (Figure 4).
Text-to-image generation
DoRA can also be applied on DreamBooth for text-to-image personalization with the advanced training scripts developed by Hugging Face. Testing results on the challenging 3d_icon and lego_set datasets show that DoRA can obtain significantly better personalization results than LoRA under the same training configurations (Figure 5).
Summary
DoRA is a generally efficient and effective training technique and will be supported soon by various NVIDIA services, platforms, and frameworks. DoRA is a fine-tuning method that is compatible with LoRA and its variants and exhibits a closer resemblance to FT learning behavior. DoRA consistently outperforms LoRA across various fine-tuning tasks and model architectures. Moreover, DoRA can be considered a costless replacement for LoRA, as its decomposed magnitude and direction components can be merged back into the pretrained weight after the training, ensuring that there is no extra inference overhead. We hope DoRA can help NVIDIA effectively adapt various foundation models to diverse applications in NVIDIA Metropolis, NVIDIA NeMo, NVIDIA NIM, NVIDIA TensorRT, audiovisual, robotics, generative AI, and more.
Check out these resources to learn more: