For organizations adapting AI foundation models with domain-specific data, the ability to rapidly create and deploy fine-tuned models is key to efficiently delivering value with enterprise generative AI applications.
NVIDIA NIM offers prebuilt, performance-optimized inference microservices for the latest AI foundation models, including seamless deployment of models customized using parameter-efficient fine-tuning (PEFT).
In some cases, it’s ideal to use methods like continual pretraining, DPO, supervised fine-tuning (SFT), or model merging, where the underlying model weights are adjusted directly in the training or customization process, unlike PEFT with low-rank adaptation (LoRA). In these cases, inference software configuration for the model must be updated for optimal performance given the new weights.
Rather than burden you with this often lengthy process, NIM can automatically build a TensorRT-LLM inference engine performance optimized for the adjusted model and GPUs in your local environment, and then load it for running inference as part of a single-step model deployment process.
In this post, we explore how to rapidly deploy NIM microservices for models that have been customized through SFT by using locally built, performance-optimized TensorRT-LLM inference engines. We include all the necessary commands as well as some helpful options, so you can try it out on your own today.
Prerequisites
To run this tutorial, you need an NVIDIA-accelerated compute environment with access to 80 GB of GPU memory and which has git-lfs
installed.
Before you can pull and deploy a NIM microservice in an NVIDIA-accelerated compute environment, you also need an NGC API key.
- Navigate to the Meta Llama 3 8B Instruct model listing in the NVIDIA API Catalog.
- Choose Login at the top right and follow the instructions.
- When you’re logged in, choose Build with this NIM on the model page.
- Choose Self-Hosted API and follow either option to access NIM microservices access:
- NVIDIA Developer Program membership with free access to NIM for research, development, and testing only.
- The 90-day NVIDIA AI Enterprise license, which includes access to NVIDIA Enterprise Support.
After you provide the necessary details for your selected access method, copy your NGC API key and be ready to move forward with NIM. For more information, see Launch NVIDIA NIM for LLMs.
Getting started with NIM microservices
Provide your NGC CLI API key as an environment variable in your compute environment:
export NGC_API_KEY=<<YOUR API KEY HERE>>
You also must point to, create, and modify permissions for a directory to be used as a cache during the optimization process:
export NIM_CACHE_PATH=/tmp/nim/.cache
mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH
To demonstrate locally built, optimized TensorRT-LLM inference engines for deploying fine-tuned models with NIM, you need a model that has undergone customization through SFT. For this tutorial, use the NVIDIA OpenMath2-Llama3.1-8B model, which is a customization of Meta’s Llama-3.1-8B using the OpenMathInstruct-2 dataset.
The base model must be available as a downloadable NIM for LLMs. For more information about downloadable NIM microservices, see the NIM Type: Run Anywhere filter in the NVIDIA API Catalog.
All you need is the weights to this model, which can be obtained in several ways. For this post, clone the model repository using the following commands:
git lfs install
git clone https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B
export MODEL_WEIGHT_PARENT_DIRECTORY=$PWD
Now that you have the model weights collected, move on to the next step: firing up the microservice.
Selecting from available performance profiles
Based on your selected model and hardware configuration, the most applicable inference performance profile available is automatically selected. There are two available performance profiles for local inference engine generation:
- Latency: Focused on delivering a NIM microservice that is optimized for latency.
- Throughput: Focused on delivering a NIM microservice that is optimized for batched throughput.
For more information about supported features, including available precision, see the Support Matrix topic in the NVIDIA NIM documentation.
Example using an SFT model
Create a locally built TensorRT-LLM inference engine for OpenMath2-Llama3.1-8B by running the following commands:
docker run -it --rm --gpus all \
--user $(id -u):$(id -g)\
--network=host \
--shm-size=32GB \
-e NGC_API_KEY \
-e NIM_FT_MODEL=/opt/weights/hf/OpenMath2-Llama3.1-8B \
-e NIM_SERVED_MODEL_NAME=OpenMath2-Llama3.1-8B \
-v $NIM_CACHE_PATH:/opt/nim/.cache \
-v $MODEL_WEIGHT_PARENT_DIRECTORY:/opt/weights/hf \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.0
The command is nearly identical to the typical command you’d use to deploy a NIM microservice. In this case, you’ve added the extra NIM_FT_MODEL
parameter, which points to the OpenMath2-Llama3.1-8B model.
With that, NIM builds an optimized inference engine locally. To perform inference using this new NIM microservice, run the following Python code example:
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:8000/v1",
api_key = "none"
)
completion = client.chat.completions.create(
model="OpenMath2-Llama3.1-8B",
messages=[{"role":"user","content":"What is your name?"}],
temperature=0.2,
top_p=0.7,
max_tokens=100,
stream=True
)
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Building an optimized TensorRT-LLM engine with a custom performance profile
On supported GPUs, you can use a similar command to spin up your NIM microservice. Follow the Model Profile instructions to launch your microservice and determine which profiles are accessible for it.
export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.0"
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles \
-e NGC_API_KEY=$NGC_API_KEY
Assuming you’re on an H100 GPU, you should see the following profiles available:
tensorrt_llm-h100-fp8-tp1-throughput
tensorrt_llm-h100-fp8-tp2-latency
Re-run the command and provide an additional environment variable to specify the desired profile:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_MODEL_PROFILE=tensorrt_llm-h100-fp8-tp2-latency
Now that you’ve relaunched your NIM microservice with the desired profile, use Python to interact with the model:
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:8000/v1",
api_key = "none"
)
completion = client.chat.completions.create(
model="llama-3.1-8b-instruct",
messages=[{"role":"user","content":"What is your name?"}],
temperature=0.2,
top_p=0.7,
max_tokens=100,
stream=True
)
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Conclusion
Whether you’re using PEFT or SFT methods for model customization, NIM accelerates customized model deployment for high-performance inferencing in a few simple steps. With optimized TensorRT-LLM inference engines built automatically in your local environment, NIM is unlocking new possibilities for rapidly deploying accelerated AI inferencing anywhere.
Learn more and get started today by visiting the NVIDIA API catalog and checking out the documentation. To engage with NVIDIA and the NIM microservices community, see the NVIDIA NIM developer forum.