Agentic AI / Generative AI

Deploying Fine-Tuned AI Models with NVIDIA NIM

Nov 21, 2024

By Bethann Noble and Chris Alexiuk

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA NIM offers prebuilt, performance-optimized inference microservices for the latest AI foundation models, including seamless deployment of models customized using parameter-efficient fine-tuning.
To deploy fine-tuned models with NIM, a locally built TensorRT-LLM inference engine can be created, which is performance-optimized for the adjusted model and GPUs in the local environment.
NIM can automatically build and load a TensorRT-LLM inference engine as part of a single-step model deployment process, allowing for rapid deployment of customized models.

AI-generated content may summarize information incompletely. Verify important information. Learn more

For organizations adapting AI foundation models with domain-specific data, the ability to rapidly create and deploy fine-tuned models is key to efficiently delivering value with enterprise generative AI applications.

NVIDIA NIM offers prebuilt, performance-optimized inference microservices for the latest AI foundation models, including seamless deployment of models customized using parameter-efficient fine-tuning (PEFT).

In some cases, it’s ideal to use methods like continual pretraining, DPO, supervised fine-tuning (SFT), or model merging, where the underlying model weights are adjusted directly in the training or customization process, unlike PEFT with low-rank adaptation (LoRA). In these cases, inference software configuration for the model must be updated for optimal performance given the new weights.

Rather than burden you with this often lengthy process, NIM can automatically build a TensorRT-LLM inference engine performance optimized for the adjusted model and GPUs in your local environment, and then load it for running inference as part of a single-step model deployment process.

In this post, we explore how to rapidly deploy NIM microservices for models that have been customized through SFT by using locally built, performance-optimized TensorRT-LLM inference engines. We include all the necessary commands as well as some helpful options, so you can try it out on your own today.

Prerequisites

To run this tutorial, you need an NVIDIA-accelerated compute environment with access to 80 GB of GPU memory and which has git-lfs installed.

Before you can pull and deploy a NIM microservice in an NVIDIA-accelerated compute environment, you also need an NGC API key.

Navigate to the Meta Llama 3 8B Instruct model listing in the NVIDIA API Catalog.
Choose Login at the top right and follow the instructions.
When you’re logged in, choose Build with this NIM on the model page.
Choose Self-Hosted API and follow either option to access NIM microservices access:
- NVIDIA Developer Program membership with free access to NIM for research, development, and testing only.
- The 90-day NVIDIA AI Enterprise license, which includes access to NVIDIA Enterprise Support.

After you provide the necessary details for your selected access method, copy your NGC API key and be ready to move forward with NIM. For more information, see Launch NVIDIA NIM for LLMs.

Getting started with NIM microservices

Provide your NGC CLI API key as an environment variable in your compute environment:

export NGC_API_KEY=<<YOUR API KEY HERE>>

You also must point to, create, and modify permissions for a directory to be used as a cache during the optimization process:

export NIM_CACHE_PATH=/tmp/nim/.cache
mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH

To demonstrate locally built, optimized TensorRT-LLM inference engines for deploying fine-tuned models with NIM, you need a model that has undergone customization through SFT. For this tutorial, use the NVIDIA OpenMath2-Llama3.1-8B model, which is a customization of Meta’s Llama-3.1-8B using the OpenMathInstruct-2 dataset.

The base model must be available as a downloadable NIM for LLMs. For more information about downloadable NIM microservices, see the NIM Type: Run Anywhere filter in the NVIDIA API Catalog.

All you need is the weights to this model, which can be obtained in several ways. For this post, clone the model repository using the following commands:

git lfs install
git clone https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B
export MODEL_WEIGHT_PARENT_DIRECTORY=$PWD

Now that you have the model weights collected, move on to the next step: firing up the microservice.

Selecting from available performance profiles

Based on your selected model and hardware configuration, the most applicable inference performance profile available is automatically selected. There are two available performance profiles for local inference engine generation:

Latency: Focused on delivering a NIM microservice that is optimized for latency.
Throughput: Focused on delivering a NIM microservice that is optimized for batched throughput.

For more information about supported features, including available precision, see the Support Matrix topic in the NVIDIA NIM documentation.

Example using an SFT model

Create a locally built TensorRT-LLM inference engine for OpenMath2-Llama3.1-8B by running the following commands:

docker run -it --rm --gpus all \
--user $(id -u):$(id -g)\
	--network=host \
	--shm-size=32GB \
	-e NGC_API_KEY \
	-e NIM_FT_MODEL=/opt/weights/hf/OpenMath2-Llama3.1-8B \
	-e NIM_SERVED_MODEL_NAME=OpenMath2-Llama3.1-8B \
-v $NIM_CACHE_PATH:/opt/nim/.cache \
-v $MODEL_WEIGHT_PARENT_DIRECTORY:/opt/weights/hf \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.0

The command is nearly identical to the typical command you’d use to deploy a NIM microservice. In this case, you’ve added the extra NIM_FT_MODEL parameter, which points to the OpenMath2-Llama3.1-8B model.

With that, NIM builds an optimized inference engine locally. To perform inference using this new NIM microservice, run the following Python code example:

from openai import OpenAI

client = OpenAI(
  base_url = "http://localhost:8000/v1",
  api_key = "none"
)

completion = client.chat.completions.create(
  model="OpenMath2-Llama3.1-8B",
  messages=[{"role":"user","content":"What is your name?"}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=100,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

Video 1. How to Deploy Fine-Tuned AI Models

Building an optimized TensorRT-LLM engine with a custom performance profile

On supported GPUs, you can use a similar command to spin up your NIM microservice. Follow the Model Profile instructions to launch your microservice and determine which profiles are accessible for it.

export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.0"
docker run --rm --gpus=all -e NGC_API_KEY=$NGC_API_KEY $IMG_NAME list-model-profiles

Assuming you’re in an environment with two (or more) H100 GPUs, you should see the following profiles available:

tensorrt_llm-h100-bf16-tp2–pp1-throughput
tensorrt_llm-h100-bf16-tp2-pp1-latency

Re-run the command and provide an additional environment variable to specify the desired profile:

docker run --rm --gpus=all \
  -e NGC_API_KEY \
  -e NIM_FT_MODEL=/opt/weights/hf/OpenMath2-Llama3.1-8B \
  -e NIM_SERVED_MODEL_NAME=OpenMath2-Llama3.1-8B \
  -e NIM_MODEL_PROFILE=tensorrt_llm-h100-bf16-tp2-pp1-latency \
  -v $NIM_CACHE_PATH:/opt/nim/.cache \
  -v $MODEL_WEIGHT_PARENT_DIRECTORY:/opt/weights/hf \
  $IMG_NAME

Now that you’ve relaunched your NIM microservice with the desired profile, use Python to interact with the model:

from openai import OpenAI

client = OpenAI(
  base_url = "http://localhost:8000/v1",
  api_key = "none"
)

completion = client.chat.completions.create(
  model="llama-3.1-8b-instruct",
  messages=[{"role":"user","content":"What is your name?"}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=100,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

Conclusion

Whether you’re using PEFT or SFT methods for model customization, NIM accelerates customized model deployment for high-performance inferencing in a few simple steps. With optimized TensorRT-LLM inference engines built automatically in your local environment, NIM is unlocking new possibilities for rapidly deploying accelerated AI inferencing anywhere.

Learn more and get started today by visiting the NVIDIA API catalog and checking out the documentation. To engage with NVIDIA and the NIM microservices community, see the NVIDIA NIM developer forum.

Discuss (0)

About the Authors

About Bethann Noble
Bethann Noble is a product marketing manager for enterprise software products at NVIDIA, including the NVIDIA AI Enterprise software platform with NVIDIA NIM. Previously, she held senior positions in marketing and product marketing at AI copilot startup Continual, AI-powered bot protection platform HUMAN Security, Cloudera, and IBM. Bethann has a bachelor’s degree in mathematics from the University of Texas at Austin.

View all posts by Bethann Noble

About Chris Alexiuk
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.

View all posts by Chris Alexiuk