Generative AI

A Simple Guide to Deploying Generative AI with NVIDIA NIM

Whether you’re working on-premises or in the cloud, NVIDIA NIM inference microservices provide enterprise developers with easy-to-deploy optimized AI models from the community, partners, and NVIDIA. Part of NVIDIA AI Enterprise, NIM offers a secure, streamlined path forward to iterate quickly and build innovations for world-class generative AI solutions.

Using a single optimized container, you can easily deploy a NIM in under 5 minutes on accelerated NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, if you want to avoid deploying a container, you can begin prototyping your applications with NIM APIs from the NVIDIA API catalog

  • Use prebuilt containers that deploy with a single command on NVIDIA accelerated infrastructure anywhere.
  • Maintain security and control of your data, your most valuable enterprise resource.
  • Achieve best accuracy with support for models that have been fine-tuned using techniques like LoRA.
  • Integrate accelerated AI inference endpoints leveraging consistent, industry-standard APIs.
  • Work with the most popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack. 

This post walks through a simple Docker deployment of NVIDIA NIM. You’ll be able to use NIM microservices APIs across the most popular generative AI application frameworks like Haystack, LangChain, and LlamaIndex. For a full guide to deploying NIM, see the the NIM documentation

How to deploy NIM in 5 minutes 

Before you get started, make sure you have all the prerequisites. Follow the requirements in the NIM documentation. Note that an NVIDIA AI Enterprise License is required to download and use NIM.

When you have everything set up, run the following script:

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE="~/.cache/nim"
mkdir -p "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Next, test an inference request:

curl -X 'POST' \
    'http://0.0.0.0:8000/v1/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta/llama3-8b-instruct",
      "prompt": "Once upon a time",
      "max_tokens": 64
    }'

Now you have a controlled, optimized production deployment to securely build generative AI applications. 

Sample NVIDIA-hosted deployments of NIM are also available on the NVIDIA API catalog

Note that as a new version of NIM is released, the most up-to-date documentation will always be at https://docs.nvidia.com/nim.

How to integrate NIM with your applications 

While the previous setup should be completed first, if you’re eager to test NIM without deploying on your own, you can do so using NVIDIA-hosted API endpoints in the NVIDIA API catalog. Follow the steps below.

Integrate NIM endpoints

You can start with a completions curl request that follows the OpenAI spec. Note that to stream outputs, you should set stream to True

To use NIMs in Python code with the OpenAI library: 

  • You don’t need to provide an API key if you’re using a NIM.

Make sure to update the base_url to wherever your NIM is running.

from openai import OpenAI

client = OpenAI(
  base_url = "http://0.0.0.0:8000/v1",
  api_key="no-key-required"
)

completion = client.chat.completions.create(
  model="meta/llama3-8b-instruct",
  messages=[{"role":"user","content":"What is a GPU?"}]
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

NIM is also integrated into application frameworks like Haystack, LangChain, and LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers already building amazing generative AI applications with these popular tools. 

Check out notebooks from each of these frameworks to learn how to use NIM:

Get more from NIM

With fast, reliable, and simple model deployment using NVIDIA NIM, you can focus on building performant and innovative generative AI workflows and applications. To get even more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters

NIMs are regularly released and improved. Visit the API catalog often to see the latest NVIDIA NIM microservices for vision, retrieval, 3D, digital biology, and more. 

Discuss (3)

Tags