Generative AI / LLMs

Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning

Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient fine-tuning (PEFT) methods have been introduced to fine-tune pretrained models with a minimal number of parameters. Among these, Low-Rank Adaptation (LoRA) and its variants have gained considerable popularity because they avoid additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning. 

NVIDIA Research Taiwan and the NVIDIA Learning and Perception Research Group developed Weight-Decomposed Low-Rank Adaptation (DoRA), which could be the default replacement for LoRA. DoRA improves both the learning capacity and stability of LoRA, without introducing any additional inference overhead. 

DoRA consistently outperforms LoRA across a wide variety of large language model (LLM) and vision language model (VLM) tasks, such as common-sense reasoning (+3.7/+1.0 on Llama 7B/13B, +2.9 on Llama 2 7B, and +4.4 on Llama 3 8B), Multi-Turn (MT) Benchmark (+0.4/+0.3 for Llama/Llama 2 7B), image/video-text understanding (+0.9/+1.9 on VL-BART), and visual instruction tuning (+0.6 on LLaVA 7B). DoRA has also been demonstrated in other tasks, including compression-aware LLM and text-to-image generation. This work has been accepted to ICML 2024 as an oral paper (1.5% acceptance rate).

Diagram showing that DoRA consistently outperforms LoRA on various tasks (LLM, VLM, LVLM) and backbones (Llama 2 and 3).
Figure 1. Comparison of DoRA and LoRA on various tasks and backbones

How does DoRA work?

DoRA begins by decomposing the pretrained weight into its magnitude and directional components and then fine-tunes both. Given the substantial size of the directional component in terms of parameters, DoRA exploits LoRA for ‌directional adaptation to enable efficient fine-tuning, as illustrated in Figure 2. Finally, DoRA can be merged with the pretrained weight before inference, thereby avoiding the introduction of additional latency.

Diagram of proposed DoRA, which decomposes the pretrained weight into magnitude and direction components for fine-tuning, especially with LoRA to efficiently update the direction component.
Figure 2. An overview of DoRA

How does DoRA affect model training? 

To investigate how DoRA affects model training, the magnitude and directional differences (∆D, ∆M) between the DoRA weight W’ and the pretrained weight W0 are visualized in Figure 3 (so as FT and LoRA). From the regression line for (∆D, ∆M) of both DoRA and FT, a distinct negative slope characterizes DoRA and FT, instead of a clear positive correlation shown by LoRA. Different markers represent matrices of different training steps and different colors represent the matrices of each layer.

Figure shows magnitude and direction updates of FT, LoRA, and DoRA of the query matrices across different layers and intermediate steps. DoRA and FT show a distinct negative slope while LoRA shows a clear positive correlation, indicating that DoRA has a learning capacity closely resembling FT.
Figure 3. Magnitude and direction updates of FT, LoRA, and DoRA

DoRA demonstrates the ability to make only substantial directional adjustments with relatively minimal changes in magnitude or the reverse, while showing learning patterns closer to FT. This signifies its superior learning capacity over LoRA. For more qualitative and mathematical analyses, see DoRA: Weight-Decomposed Low-Rank Adaptation.

Performance

DoRA outperforms LoRA across a wide variety of models, including LLM, VLM, compressed LLM, and diffusion models. 

Large language models 

DoRA significantly outperforms LoRA in terms of the overall commonsense reasoning ability, as shown in Table 1. Moreover, DoRA can provide better conversation and instruction-following capabilities than LoRA, as demonstrated by the MT Benchmark in Table 2.

Model# Params (%)BoolQ PIQASIQAHellaSwag WinoGrande ARC-e ARC-c OBQA Avg.
ChatGPT-3.573.185.468.578.566.189.879.974.877.0
Llama-LoRA0.8368.980.777.478.178.877.861.374.874.7
Llama-DoRA (Ours)0.8469.783.478.687.281.081.966.279.278.4
Llama 2-LoRA0.8369.879.979.583.682.679.864.781.077.6
Llama 2-DoRA (Ours)0.8472.083.179.989.183.084.571.081.280.5
Llama 3-LoRA0.8370.885.279.991.784.384.271.279.080.8
Llama 3-DoRA (Ours)0.8474.689.379.995.585.690.580.485.885.2
Table 1. Comparison of LoRA and DoRA on the commonsense reasoning benchmark

Model# Params (%)Score
Llama-LoRA2.315.1
Llama-DoRA (Ours)2.335.5
Llama-VeRA0.024.3
Llama-DVoRA (Ours)0.045.0
Llama 2-LoRA2.315.7
Llama 2-DoRA (Ours)2.336.0
Llama 2-VeRA0.025.5
Llama 2-DVoRA (Ours)0.046.0
Table 2. Comparison of LoRA and DoRA on MT-Bench (scored by GPT-4). DVoRA is obtained by integrating DoRA on VeRA 

Vision language models 

In addition to pure natural language processing (NLP), DoRA also outperforms LoRA in terms of image-text understanding (Table 3), video-text understanding (Table 4), and visual instruction tuning (Table 5) abilities.

Model# Params (%)VQAv2GQANVLR2COCO Cap.Avg.
VLBART-LoRA5.9365.253.671.9115.376.5
VLBART-DoRA (Ours)5.9665.854.773.1115.977.4
Table 3. Comparison of LoRA and DoRA on image-text understanding tasks

Model# Params (%)TVQA How2QA TVC YC2CAvg.
VLBART-LoRA5.1775.572.944.6140.983.5
VLBART-DoRA (Ours)5.1976.374.145.8145.485.4
Table 4. Comparison of LoRA and DoRA on video-text understanding tasks

Model# Params (%)VQAv2 GQA Vis-Wiz 
SQA VQAT POPE MMBench Avg.
LLaVA-LoRA4.6179.162.947.868.458.286.466.166.9
LLaVA-DoRA (Ours)4.6378.662.952.269.957.087.266.167.6
Table 5. Comparison of LoRA and DoRA on visual instruction tuning tasks

Compression-aware LLMs

To further decrease the memory demands of PEFT fine-tuning, QLoRA suggests quantizing the pretrained model to 4-bit and fine-tuning LoRA on top of the frozen low-bit backbone. With DoRA, which narrows the gap between LoRA and FT, it is natural to also explore whether DoRA can enhance the accuracy of LoRA within the QLoRA framework. 

Recently, our team collaborated with several researchers in Answer.AI on their QDoRA project, which substitutes the LoRA component in QLoRA with DoRA. The results show that QDoRA outperforms FT, QLoRA on both Llama 2 and Llama 3, respectively (Figure 4).

Graph showing that QDoRA significantly outperforms QLoRA on the Math Problem Benchmark, Orca-Math, with either Llama2 or Llama3 backbone. QDoRA+Llama2 has comparable results with QLoRA+Llama3. Moreover, QDoRA outperforms FT, which requires much larger memory.
Figure 4. Accuracy comparison of QDoRA and other methods on the Orca-Math dataset including 100K training samples

Text-to-image generation

DoRA can also be applied on DreamBooth for text-to-image personalization with the advanced training scripts developed by Hugging Face. Testing results on the challenging 3d_icon and lego_set datasets show that DoRA can obtain significantly better personalization results than LoRA under the same training configurations (Figure 5). 

Two sets of images showing that on the challenging 3d_icon and lego_set datasets, DoRA can obtain significantly better personalization results than LoRA under the same DreamBooth training configurations.
Figure 5. Personalization results using DreamBooth plus DoRA on the challenging 3D Icon (top) and Lego (bottom) datasets

Summary

DoRA is a generally efficient and effective training technique and will be supported soon by various NVIDIA services, platforms, and frameworks. DoRA is a fine-tuning method that is compatible with LoRA and its variants and exhibits a closer resemblance to FT learning behavior. DoRA consistently outperforms LoRA across various fine-tuning tasks and model architectures. Moreover, DoRA can be considered a costless replacement for LoRA, as its decomposed magnitude and direction components can be merged back into the pretrained weight after the training, ensuring that there is no extra inference overhead. We hope DoRA can help NVIDIA effectively adapt various foundation models to diverse applications in NVIDIA Metropolis, NVIDIA NeMo, NVIDIA NIM, NVIDIA TensorRT, audiovisual, robotics, generative AI, and more.

Check out these resources to learn more:

Discuss (0)

Tags