Deploy Multilingual LLMs with NVIDIA NIM

Multilingual large language models (LLMs) are increasingly important for enterprises operating in today’s globalized business landscape. As businesses expand their reach across borders and cultures, the ability to communicate effectively in multiple languages is crucial for success. By supporting and investing in multilingual LLMs, enterprises can break down language barriers, foster inclusivity, and gain a competitive edge in the global marketplace.

Foundation models often face challenges when dealing with multilingual languages. A majority are primarily trained on English text corpora, leading to an inherent bias towards Western linguistic patterns and cultural norms.

This results in LLMs struggling to accurately capture the nuances, idioms, and cultural contexts specific to non-Western languages and societies. Additionally, the lack of high-quality digitized text data for many low-resource languages exacerbates the resource scarcity issue, making it difficult for LLMs to learn and generalize effectively across these languages. Consequently, LLMs often fail to reflect the culturally appropriate expressions, emotional connotations, and contextual subtleties inherent in non-Western languages, leading to potential misinterpretations or biased outputs.

According to a recent Meta Llama 3 blog post, “To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. However, we do not expect the same level of performance in these languages as in English.”

In this post, we explore two LoRA-tuned adapters from different regions and deploy them with Llama 3 NIM so that the NVIDIA NIM performs better in Chinese and Hindi.

Using these adapters enhances the accuracy of these languages over Llama 3, as they are specifically fine-tuned on additional Chinese and Hindi text, respectively. Additionally, we use LoRA’s design to our advantage, as it captures all of the extra language information in smaller, low-rank matrices for each model.

This enables us to load a single base model with the low-rank matrices A and B for each respective LoRA-tuned variant. In this manner, it’s possible to store thousands of LLMs and run them dynamically and efficiently within a minimal GPU memory footprint.

Tutorial prerequisites

To make the best use of this tutorial, you need basic knowledge of LLM training and inference pipelines, as well as:

Hugging Face registered user access and general familiarity with the transformers library.
Llama3-8B Instruct NIM from the NVIDIA API catalog. Deploy it by clicking on “Run Anywhere with NIM.”

What is NVIDIA NIM?

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed to speed up generative AI deployment in enterprises. Supporting a wide range of AI models, including NVIDIA AI foundation, community, and custom models, it ensures seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry-standard APIs.

NIMs provide interactive APIs for running inference on an AI model. The container images are packaged on a per-model or model-family basis (see Figure 1). Each NIM is its own Docker container with a model and includes a runtime that runs on any NVIDIA GPU with sufficient GPU memory. Under the hood, NIMs use NVIDIA TensorRT-LLM to optimize the models, with specialized accelerated profiles optimally selected for NVIDIA H100 Tensor Core GPUs, NVIDIA A100 Tensor Core GPUs, NVIDIA A10 Tensor Core GPUs, and NVIDIA L40S GPUs.

In this blog post, we extend the base NIM with multilingual capabilities by dynamically serving multiple LoRA models, one for each language.

Basic workflow for foundation LLMs

NVIDIA NIM has a set of pre-built LLMs, which can be easily set up and served. The following command will serve the Llama-3-8b-instruct NIM using one GPU.

docker run -it --rm --name=meta-llama-3-8b-instruct \
--runtime=nvidia \
--gpus all \
-p 8000:8000 \
nvcr.io/nvidia/nim/meta-llama3-70b-instruct:1.0.0

Once deployed, we can run inference using the following command:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
  "model": "meta/llama-3-8b-instruct",
  "prompt": "Once upon a time",
  "max_tokens": 64,
}'

Before continuing to the next step, stop the server:

docker stop meta-llama-3-8b-instruct

Advanced workflow for custom-tuned LLMs

In the case of non-Western languages, foundation models don’t always have the same level of accuracy and robustness as English, as there is considerably less training data for those languages.

In this case, parameter-efficient finetuning techniques such as LoRA can be used to adapt a pre-trained LLM to perform better for such languages.

Deploying multilingual LLMs comes with the challenge of efficiently serving hundreds or even thousands of tuned models. For example, a single-base LLM, such as Llama 2, may have many LoRA-tuned variants per language or locale. A standard system would require loading all the models independently, taking up large amounts of memory capacity.

We specifically chose LoRA as it captures all the information in smaller low-rank matrices per model, by loading a single base model together with the low-rank matrices A and B for each respective LoRA tuned variant. In this manner, it’s possible to store thousands of LLMs and run them dynamically and efficiently within a minimal GPU memory footprint. See more theoretical background on LoRA in this post.

NVIDIA NIM supports LoRA adapters trained using either HuggingFace or NVIDIA NeMo, which we will use to add more robust support for non-Western languages on top of Llama 3 8B Instruct. Check out this blog post to learn more about deploying LoRA adapters with NIM.

We now show how we can enrich NIM capabilities with multiple languages using LoRA.

Download model:

1. Install git lfs if needed

git lfs install

2. Git clone Chinese and Hindi LoRa-tuned models from Hugging Face

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY

git clone https://huggingface.co/AdithyaSK/LLama3-Gaja-Hindi-8B-Instruct-alpha
git clone https://huggingface.co/shibing624/llama-3-8b-instruct-262k-chinese-lora

popd

chmod -R 777 $LOCAL_PEFT_DIRECTORY

Organize your LoRA model store

LoRA adapters must be stored in separate directories, and one or more LoRA directories within the LOCAL_PEFT_DIRECTORY directory. The loaded adapters are automatically named after the directories they’re stored in. NVIDIA NIM for LLMs supports the NeMo and HuggingFace Transformers compatible format.

In the case of HuggingFace, the LoRA must contain an adapter_config.json file and one of {adapter_model.safetensors, adapter_model.bin} files. The supported target modules for NIM are ["gate_proj", "o_proj", "up_proj", "down_proj", "k_proj", "q_proj", "v_proj"].

The LOCAL_PEFT_DIRECTORY should be organized according to the structure below:

loras
├── llama-3-8b-instruct-262k-chinese-lora
│   ├── adapter_config.json
│   └── adapter_model.safetensors
└── LLama3-Gaja-Hindi-8B-Instruct-alpha
    ├── adapter_config.json
    └── adapter_model.safetensors

Deploy multiple LoRA models with NIM

After setting up relevant LoRA-tuned models, we can serve the model for inference. This is similar to running the base NIM, but we now specify the LoRA directory.

export NIM_PEFT_SOURCE=/home/nvs/loras
export LOCAL_PEFT_DIRECTORY=/home/nvs/loras
export NIM_PEFT_REFRESH_INTERVAL=3600   
export CONTAINER_NAME=meta-llama3-8b-instruct

export NIM_CACHE_PATH=~/nim-cache
chmod -R 777 $NIM_CACHE_PATH

NIM enables loading of multiple LoRAs with different rank sizes, but sets an upper limit at 32. In the case of Hindi LoRA, we need to set the maximum rank above the default, as this particular model is tuned with a rank of 64.

export NIM_MAX_LORA_RANK=64

Now serve the NIM with LoRAs. The command is similar to running the Llama 3 foundation model, but will also load all the LoRA models stored in LOCAL_PEFT_DIRECTORY.

docker run -it --rm --name=$CONTAINER_NAME \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_PEFT_SOURCE \
    -e NIM_PEFT_REFRESH_INTERVAL \
    -e NIM_MAX_LORA_RANK \
    -v $NIM_CACHE_PATH:/opt/nim/.cache \
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

Once served, we can run inference on any of the LoRA models we stored beforehand. Check for available LoRA NIMs using the following command:

url -X GET 'http://0.0.0.0:8000/v1/models'

This will output all the LoRA-tuned models we can now use for inference.

{
"Object":"list",
"Data":[
{"id":
"meta/llama3-8b-instruct","object":"model","created":1717511877,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-06017a10c1b1422cb0596baa7fec744d","object":"model_permission","created":1717511877,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]},
{"id":"llama-3-8b-instruct-262k-chinese-lora","object":"model","created":1717511877,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-ad5ce194c084490ca9f8aa5f23c4fd2f","object":"model_permission","created":1717511877,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]},
{"id":"LLama3-Gaja-Hindi-8B-Instruct-alpha","object":"model","created":1717511877,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-e11405b8a2f746f5b189de2766336eac","object":"model_permission","created":1717511877,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]},
}

Inference with NIM

Use the following cURL command for inference, specifying the name of the LoRA model:


curl -X 'POST' \
  'http://0.0.0.0:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"model": "llama-3-8b-instruct-262k-chinese-lora",
"prompt": "介绍一下机器学习",
"max_tokens": 512
}'

These are the results:

{
"Id":"cmpl-92780e47ef854328a48330d6813e8a26",
"Object":"text_completion",
"Created":1717512584,
"Model":"llama-3-8b-instruct-262k-chinese-lora",
"Choices":[
{
"Index":0,
"text":"算法的基本概念和应用场景?\n\n机器学习算法是一类用于自动处理和分析数据的算法。这些算法可以学习从数据中提取模式、关系和预测性质。在这个回答中，我们将介绍机器学习算法的基本概念和应用场景。\n\n机器学习算法的基本概念：\n\n1. 训练数据集：机器学习算法学习从数据集中获取的样本。\n2. 模型训练：算法分析训练数据集，学习模式和关系。\n3. 测试数据集：训练后的模型评估性能在新的数据集上。\n4. 训练和测试迭代：重复训练和测试步骤，以提升算法的准确性。\n\n机器学习算法的应用场景：\n\n1. 数据挖掘：机器学习算法用于发现隐藏在数据中的模式和关系。\n2. 预测和预测分析：算法用于预测未来事件、趋势和绩效。\n3. recommender systems：机器学习算法推荐产品、服务或内容，基于用户行为和偏好。\n4. 自然语言处理（NLP）：机器学习算法用于理解、翻译和生成人类语言。\n5. 图像和视频处理：算法用于图像和视频分类、识别和无人机。\n\n总之，机器学习算法是自动处理和分析数据的强大工具。它们可以用于各种应用场景，包括数据挖掘、预测分析和自然语言处理。通过 sürekli 进化和改进，机器学习算法继续驱动各种 industries 和领域的创新。",
"Logprobs":null,
"Finish_reason":"stop",
"Stop_reason":null}],
"Usage":{
"Prompt_tokens":6,
"Total_tokens":370,
"completion_tokens":364}}

Now let’s try the Hindi LoRA.

curl -X 'POST'   'http://0.0.0.0:8000/v1/completions'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
"model": "LLama3-Gaja-Hindi-8B-Instruct-alpha",
"prompt": "मैं अपने समय प्रबंधन कौशल को कैसे सुधार सकता हूँ? मुझे पांच बिंदु बताएं और उनका वर्णन करें।",
"max_tokens": 512
}'

The following results are:

{
"Id":"cmpl-ddf3fa6e78fc4d24b0bbbc446fb57451",
"Object":"text_completion",
"Created":1717513888,
"model":"LLama3-Gaja-Hindi-8B-Instruct-alpha",
"Choices":[
{"index":0,
"text":" मैंने पिछले साल अपनी व्यस्त भावनाओं और अव्यवस्था के बारे में बहुत सोचा है, लेकिन सच कहा, मैं अब तक कुछ तरीकों की कोशिश नहीं की हूं।\nमैंने अपनी कार्य, परिवार, और मितृत के प्रति दायित्वों के बीच लक्षित दिवस के 24 घंटे को व्याख्या नहीं की है, इस प्रकार अपने प्राथमिकताओं को परिभाषित नहीं कर पाया हूं। मैंने पिछले साल एक नया नौकरी उत्क्षेप नहीं किया, एक नया काम विशेष रूप से अपने व्यवसाय के त्वरित विकास के बिना पर था। मेरे पास घर में अधिक मुक्त समय है, लेकिन कुछ भी नहीं किया है या पूरा नहीं किया है। मैं स्वयंसेवा और उईट्रोबोल्टर के संतुलन की भी हैरानी होने लगी है, जो एक स्थायी चुनौतीपूर्ण बात है। मैं डेडलाइन के प्रचालय अभियान, सहकारियों को पकड़ना और अपनी आवश्यक वस्तुओं को प्राप्त करने के लिए बढ़ाना शुरू करूं, मेरे जीवन में अव्यवस्था के अनुकूलन में मदद मिलता है।\nमैं अपने भावनात्मक लाभों पर ध्यान केंद्रित करना चाहता हूं, मेरी अपनी जीवन शैली के घनिष्ठ होने के बावजूद। मैंने महसूस किया कि मेरे रोजमर्रा के जीवन में कई दोष थे, और कर_epsilon.org यह मदद करता है मैं मिडल एक्सिस trauma के लिए एक श्रेणी में प_registers (न्यूनतम, उच्चतम, साप",
"Logprobs":null,
"Finish_reason":"length",
"Stop_reason":null}],
"Usage":{"prompt_tokens":47,"total_tokens":559,"completion_tokens":512}
}

You can also use the locally-served NIM in LangChain.

from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="llama-3-8b-instruct-262k-chinese-lora", max_tokens=1000)

result = llm.invoke("介绍一下机器学习")
print(result.content)

Note that as a new version of NIM is released, the most up-to-date documentation will always be at https://docs.nvidia.com/nim.

Conclusion

NVIDIA NIM enables you to seamlessly deploy and scale multiple LoRA adapters, empowering developers with the ability to add support for additional languages. Using this feature, enterprises can serve hundreds of LoRAs over the same base NIM, dynamically selecting the relevant LoRA adapter per respective language.

Getting started with NVIDIA NIM is easy. Within the NVIDIA API catalog, developers have access to a wide range of AI models to build and deploy their AI applications.

Start prototyping directly in the catalog using the graphical user interface or interact directly with the API for free. To deploy NIM inference microservices on your infrastructure, check out A Simple Guide to Deploying Generative AI with NVIDIA NIM.