The latest state-of-the-art foundation large language models (LLMs) have billions of parameters and are pretrained on trillions of tokens of input text. They often achieve striking results on a wide variety of use cases without any need for customization. Despite this, studies have shown that the best accuracy on downstream tasks can be achieved by adapting LLMs with high-quality, domain-specific datasets.
In many cases, smaller customized models can match or even outperform larger generic LLMs while offering significantly lower deployment costs. However, customizing models for specific downstream tasks can bring significant challenges, during both creation and deployment.
Full fine-tuning (that is, updating all parameters of the model) for the largest LLMs can be difficult due to the amount of computational infrastructure required to learn across the whole model. Infrastructure costs are also increased at deployment time, where users are required to either host multiple large models in memory or tolerate increased latency as entire models are swapped in and out. Low-rank adaptation (LoRA) is a technique for mitigating both of these issues.
This post provides a brief overview of LoRA, and explains the two ways to deploy LoRA fine-tuned models. We will also discuss our approach for enabling a heterogeneous LoRA deployment of a swarm of LoRA adapters, enabling mixed-batch inference requests.
Low-rank adaptation
In the past few years, LoRA has emerged as a popular technique that tunes a very small number of additional parameters, as compared to full fine-tuning. These additional parameters, called the LoRA adapter, represent the low-rank decomposition of the changes in the dense layers of the network. LoRA operates on the observation that LLMs are overparameterized, and that newly learned information during fine-tuning has a low “intrinsic rank.” In other words, the effective changes in the model parameters are confined to a lower-dimensional subspace of the entire, very high-dimensional parameter space. With LoRA, it’s possible to reduce the number of trainable parameters by 10,000x.
Figure 1 depicts the core idea behind LoRA:
- The weights of the pretrained model (W) are frozen during customization
- Instead of updating W, two smaller trainable matrices A and B are injected, which learn task-specific information. The matrix multiplication B*A forms a matrix with the same dimensions as W, thus it can be added to W (= W + BA).
The ranks of A and B matrices are small values like 8, 16, and so on. Cumulatively, they have far fewer trainable parameters than W, which makes customization computationally and memory efficient. This rank (r) parameter is typically customizable at training time.
There exists a tradeoff between rank size and computational efficiency. A larger rank value enables better expressivity, so the model can capture more patterns relevant to the downstream task. Very high rank values (like 64) approach the capacity of learning information close to full supervised fine-tuning. That is, updating all the parameters in the model. On the downside, larger ranks are also more expensive to train and inference, both in terms of memory and compute requirements. In practice, LoRA fine-tuning with a rank value as small as 8 is already very effective, and is a good starting point for a variety of downstream tasks.
Deploying a LoRA-tuned model
LoRA fine-tunes can be deployed in the following ways.
Option 1: Merging the LoRA adapter
The additional LoRA weights can be merged with the pretrained model to create a purpose-built variant that is structurally equivalent to its predecessor. This avoids incurring any additional inference latency of managing the adapter separately. Merging weights is a simpler approach, but less flexible. The disadvantage of this approach is that the whole model becomes “bespoke” and can only serve one task at a time—that is, the one it is fine-tuned for. This makes it difficult to batch together inputs for different tasks for efficiency in deployment. It is only recommended if you plan to serve a single task per deployment.
Option 2: Dynamically loading the LoRA adapter
LoRA adapters (A and B in Figure 1) are kept separate from the base model (W). At inference, the runtime dynamically loads the adapter weights corresponding to incoming requests to serve it. It enables flexibility in serving and batching inputs from various tasks concurrently to make the best use of the available compute, without having to maintain separate custom models.
Some use cases require several, and even hundreds or thousands of LoRAs over the same base model. For these, dynamic LoRA adapter selection is a better path. Examples include:
- Enterprises serving personalized models for their customers, for serving recommendations, or adapting to their specific personas or preferences.
- A/B testing to compare between various LoRA fine-tunes of the same use case.
- Enterprises serving multiple downstream use cases based on the same base foundation model. For example, IT service teams deploying a multi-LoRA setup for bug summarization, ticket routing and classification, implementing chatbots and knowledge retrieval over specific document corpuses, root cause analysis, and more.
NVIDIA NIM offers optimized inference microservices that support such dynamic loading of LoRA adapters and allow sending mixed-batch requests. The following sections take a deeper look at our approach.
Heterogenous, multiple LoRA deployment with NVIDIA NIM
With NIM, each inference microservice is associated with a single foundation model. This model can have any number of “customizations” in the form of low-rank adapters associated with it.
- Adapters, trained using either the NVIDIA NeMo framework or Hugging Face PEFT library are placed into an adapter store and given a unique name.
- When making a request to the NIM, clients can specify that they want a particular customization by including the LoRA model name.
- When NIM receives a request for some customized model, it will pull the associated adapter from the adapter store into a multi-tier cache. Some adapters are resident in GPU memory and some in host memory, depending on how recently they were used.
- During execution, NIM will run specialized GPU kernels that let data flow through both the foundation model and multiple different low-rank adapters simultaneously. This enables it to respond to requests for multiple different custom models at the same time.
Handling a mixed batch of requests
The requests in one batch might use different LoRA adapters to support different tasks. Therefore, one traditional General Matrix Multiplication (GEMM) can’t be used to compute all the requests together. Computing them one-by-one sequentially would lead to significant additional overhead. To solve this problem, we used NVIDIA CUTLASS to implement a batched GEMM to fuse batched, heterogeneous request processing into a single kernel. This improves GPU utilization and performance.
Furthermore, we found that the GPU utilization of the batched GEMM is not sufficiently high for the first matrix component of each adapter, because this first matrix has a very large input dimension and small output dimension. Each adapter has two matrix components, A (shaped d-by-r) and B (shaped r-by-d), as seen in Figure 1. Since d is typically much larger than the LoRA rank r, we applied the splitK method to split the GEMM into several tiles on more streaming multiprocessors (SMs), improving the GPU utilization, and use an additional reduction kernel to reduce the partial results after the splitK-batched-GEMM.
Best practices for performance benchmarking
Evaluating the latency and throughput performance of such a multi-LoRA deployment is nontrivial. In this section, we discuss several major considerations generally worth looking at when benchmarking the performance of an LLM LoRA inference framework.
- Base model: Both small and large models can be used as base models for LoRA fine-tuning and inference, such as Llama 3 8B and Llama 3 70B. Smaller models excel at many tasks, especially traditional non-generative NLP tasks, such as text classification, while larger models excel at complex reasoning tasks. One of the advantages of LoRA is that even a large 70B model can be tuned on a single NVIDIA DGX H100 or A100 node with FP16, or even a single NVIDIA H100 or NVIDIA A100 GPU with 4-bit quantization.
- Adapters: In practice, from the end user’s point of view, it’s desirable to have the flexibility to experiment and select the size that yields the best accuracy. System operators, on the other hand, may want to enforce a certain fixed size uniformly, for uniform LoRAs enable better batching and hence performance. Popular choices for LoRA ranks are 8/16/32/64.
- Test parameters: Several other test parameters to be considered for benchmarking include:
- Output length control: The
ignore_eos
parameter tells the inference framework to continue generating text until it reaches themax_token_length
limit. This ensures the use case OSL (output sequence length) specification is met. This parameter is increasingly supported by LLM inference frameworks and significantly simplifies benchmarking setup. Notably, withignore_eos
you don’t have to train on “real” tasks for performance profiling purposes. - System load: Concurrency (number of concurrent users) is commonly used to drive load into the system. This should reflect real use cases, while also taking into account the max “batch size” that the system can effectively serve concurrently. For an 8B model on one GPU, consider up to 250 concurrent users for a realistic server load.
- Task type: Both generative and non-generative tasks should be considered. These differ in the ISL (input sequence length) and OSL. ISL in the [200, 2000] token range, and OSL in the [1, 2000] token range reflect a wide range of LLM applications from text classification and summary, to translation and code generation.
- Output length control: The
- Tooling: The benchmarking tool should support calling the LoRA models. GenAI-Perf is an LLM benchmarking tool designed with LoRA support. Adapters are called either uniformly at random or in a round-robin fashion, or following a distribution to reflect real usage patterns. For example, 20% of adapters account for 80% of requests.
- Metrics: In the LLM domain, the main metrics are latency. TTFT (time to first token), ITL (inter-token latency) and throughput, TPS (total system tokens per second).
Other supplementary metrics include total requests per second and end-to-end request latency.
Compared to serving a base model (or merged LoRA model), the addition of dynamic LoRAs—a single LoRA, multiple LoRAs of the same rank, or multiple LoRAs of different ranks—all induce increasing cost, both in latency and throughput. Ideally, this cost should be reasonable in exchange for the improved accuracy and flexibility that dynamic LoRAs provide.
In the coming weeks and months, we’ll have more to share on the performance characteristics of NIM when serving LoRA.
What’s next
There are exciting new enhancements to LoRA in research that aim to improve the efficiency or accuracy of fine-tuned models. Our future direction includes incorporating these into NIM.
Tied-LoRA
Tied-LoRA is a novel technique from NVIDIA Research that increases the parameter efficiency of LoRA. In LoRA, task-specific low-rank matrices are added that approximate the weight updates for each layer of the LLM. In Tied-LoRA, these low-rank matrices are shared (“tied”) between the various layers, further reducing the number of trainable parameters. Additionally, this technique allows selectively training or freezing of different components of LoRA (low-rank matrices, and scaling vectors) enabling the user to experiment with performance and parameter efficiency trade-offs.
Support for this method with NVIDIA NIM is planned for future releases.
DoRA
DoRA, another technique developed by NVIDIA Research, bridges the performance gap between fully fine-tuned models and LoRA tuning. It achieves this by decomposing pretrained weights into two components: magnitude and direction. For fine-tuning, DoRA specifically uses LoRA for directional updates, thereby minimizing the number of trainable parameters efficiently. This approach enhances the learning capacity and training stability of LoRA without incurring additional inference overhead. DoRA consistently outperforms LoRA in fine-tuning models like LLaMA, LLaVA, and VL-BART across various downstream tasks, including commonsense reasoning, visual instruction tuning, and image and video-text understanding.
Conclusion
NVIDIA NIM enables you to seamlessly deploy and scale multiple LoRA adapters. NIM is generally available now, starting with support for Meta Llama 3 8B and Llama 3 70B, and LoRA adapters in both NVIDIA NeMo and Hugging Face model formats. We’re committed to adding support for additional state-of-the-art community models in future releases.
To get started with multi-LoRA in NIM, check out the Jupyter Notebook tutorial on LoRA tuning a Llama 3 model using NVIDIA NeMo, deploying fine-tuned adapter(s) with NIM, and sending mixed inference requests. For more information about NIM, see the documentation.