Whether you’re working on-premises or in the cloud, NVIDIA NIM inference microservices provide enterprise developers with easy-to-deploy optimized AI models from the community, partners, and NVIDIA. Part of NVIDIA AI Enterprise, NIM offers a secure, streamlined path forward to iterate quickly and build innovations for world-class generative AI solutions.
Using a single optimized container, you can easily deploy a NIM in under 5 minutes on accelerated NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, if you want to avoid deploying a container, you can begin prototyping your applications with NIM APIs from the NVIDIA API catalog.
- Use prebuilt containers that deploy with a single command on NVIDIA accelerated infrastructure anywhere.
- Maintain security and control of your data, your most valuable enterprise resource.
- Achieve best accuracy with support for models that have been fine-tuned using techniques like LoRA.
- Integrate accelerated AI inference endpoints leveraging consistent, industry-standard APIs.
- Work with the most popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack.
This post walks through a simple Docker deployment of NVIDIA NIM. You’ll be able to use NIM microservices APIs across the most popular generative AI application frameworks like Haystack, LangChain, and LlamaIndex. For a full guide to deploying NIM, see the the NIM documentation.
How to deploy NIM in 5 minutes
Before you get started, make sure you have all the prerequisites. Follow the requirements in the NIM documentation. Note that an NVIDIA AI Enterprise License is required to download and use NIM.
When you have everything set up, run the following script:
# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE="~/.cache/nim"
mkdir -p "$LOCAL_NIM_CACHE"
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Next, test an inference request:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
Now you have a controlled, optimized production deployment to securely build generative AI applications.
Sample NVIDIA-hosted deployments of NIM are also available on the NVIDIA API catalog.
Note that as a new version of NIM is released, the most up-to-date documentation will always be at https://docs.nvidia.com/nim.
How to integrate NIM with your applications
While the previous setup should be completed first, if you’re eager to test NIM without deploying on your own, you can do so using NVIDIA-hosted API endpoints in the NVIDIA API catalog. Follow the steps below.
Integrate NIM endpoints
You can start with a completions curl request that follows the OpenAI spec. Note that to stream outputs, you should set stream
to True
.
To use NIMs in Python code with the OpenAI library:
- You don’t need to provide an API key if you’re using a NIM.
Make sure to update the base_url
to wherever your NIM is running.
from openai import OpenAI
client = OpenAI(
base_url = "http://0.0.0.0:8000/v1",
api_key="no-key-required"
)
completion = client.chat.completions.create(
model="meta/llama3-8b-instruct",
messages=[{"role":"user","content":"What is a GPU?"}]
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=True
)
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
NIM is also integrated into application frameworks like Haystack, LangChain, and LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers already building amazing generative AI applications with these popular tools.
Check out notebooks from each of these frameworks to learn how to use NIM:
- Haystack RAG Pipeline with Self-Deployed AI Models and NVIDIA NIM
- LangChain RAG Agent with NVIDIA NIM
- LlamaIndex RAG Pipeline with NVIDIA NIM
Get more from NIM
With fast, reliable, and simple model deployment using NVIDIA NIM, you can focus on building performant and innovative generative AI workflows and applications. To get even more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters.
NIMs are regularly released and improved. Visit the API catalog often to see the latest NVIDIA NIM microservices for vision, retrieval, 3D, digital biology, and more.