Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are built using the transformer network introduced by Google in 2017 in the Attention Is All You Need research paper.
NVIDIA NeMo framework is an end-to-end GPU-accelerated framework for training and deploying transformer-based LLMs up to a trillion parameters. In September 2022, NVIDIA announced that NeMo framework is now available in Open Beta, allowing you to train and deploy LLMs using your own data. With this announcement, several pretrained checkpoints have been uploaded to HuggingFace, enabling anyone to deploy LLMs locally using GPUs.
This post walks you through the process of downloading, optimizing, and deploying a 1.3 billion parameter GPT-3 model using the NeMo framework. It includes NVIDIA Triton Inference Server, a powerful open-source, inference-serving software that can deploy a wide variety of models and serve inference requests on both CPUs and GPUs in a scalable manner.
System requirements
While training LLMs requires massive amounts of compute power, trained models can be deployed for inference at a much smaller scale for most use cases.
The models from HuggingFace can be deployed on a local machine with the following specifications:
- Running a modern Linux OS (tested with Ubuntu 20.04).
- An NVIDIA Ampere architecture GPU or newer with at least 8 GB of GPU memory.
- At least 16 GB of system memory.
- Docker version 19.03 or newer with the NVIDIA Container Runtime.
- Python 3.7 or newer with PIP.
- A reliable Internet connection for downloading models.
- Permissive firewall, if serving inference requests from remote machines.
Preparation
The NeMo framework is now in Open Beta and available for anyone who completes the free registration form. Registration is required to gain access to the training and inference containers, as well as helper scripts to convert and deploy trained models.
Several trained NeMo framework models are hosted publicly on HuggingFace, including 1.3B, 5B, and 20B GPT-3 models. These models have been converted to the .nemo format which is optimized for inference.
Converted models cannot be retrained or fine-tuned, but they enable fully trained models to be deployed for inference. These models are significantly smaller in size compared to the pre-conversion checkpoints and are supported by the FasterTransformer (FT) format. FasterTransformer is a backend in Triton Inference Server to run LLMs across GPUs and nodes.
For the purposes of this post, we used the 1.3B model, which has the quickest inference speeds and can comfortably fit in memory for most modern GPUs.
To convert the model, run the following steps.
Download the 1.3B model to your system. Run the following command in the desired directory to keep converted models for NVIDIA Triton to read:
wget https://huggingface.co/nvidia/nemo-megatron-gpt-1.3B/resolve/main/nemo_gpt1.3B_fp16.nemo
Make a note of the folder to which the model was copied, as it is used throughout the remainder of this post.
Verify the MD5sum of the downloaded file:
$ md5sum nemo_gpt1.3B_fp16.nemo 38f7afe7af0551c9c5838dcea4224f8a nemo_gpt1.3B_fp16.nemo
Use a web browser to log in to NGC at ngc.nvidia.com. Enter the Setup menu by selecting your account name. Select Get API Key followed by Generate API Key to create the token. Make a note of the key as it is only shown one time.
In the terminal, add the token to Docker:
$ docker login nvcr.io Username: $oauthtoken Password: <insert token here>
Replace <insert token here>
with the token that was generated. The username must be exactly $oauthtoken
, as this indicates that a personal access token is being used.
Pull the latest training and inference images for the NeMo framework:
$ docker pull nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3 $ docker pull nvcr.io/ea-bignlp/bignlp-inference:22.08-py3
At the time of publication, the latest image tags are 22.08.01-py3
for training and 22.08-py3
for inference. We recommend checking for newer tags on NGC and pulling those, if available.
Verify that the images were pulled successfully, as the IDs might change with different tags:
$ docker images | grep "ea-bignlp/bignlp" nvcr.io/ea-bignlp/bignlp-training 22.08.01-py3 d591b7488a47 11 days ago 17.3GB nvcr.io/ea-bignlp/bignlp-inference 22.08-py3 77a6681df8d6 2 weeks ago 12.2GB
Model conversion
To optimize throughput and latency of the model, it can be converted to the FT format, which contains performance modifications to the encoder and decoder layers in the transformer architecture.
FT can serve inference requests with 3x quicker latencies or more compared to their non-FT counterparts. The NeMo framework training container includes the FT framework as well as scripts to convert a .nemo file to the FT format.
Triton Inference Server expects models to be stored in a model repository. Model repositories contain checkpoints and model-specific information that Triton Inference Server reads to tune the model at deployment time. As with the FT framework, the NeMo framework training container includes scripts to convert the FT model to a model repository for Triton.
Converting a model to the FT format and creating a model repository for the converted model can be done in one pass in a Docker container. To create an FT-based model repository, run the following command. Items that might have to change are in bold.
docker run --rm \ --gpus all \ --shm-size=16GB \ -v /path/to/checkpoints:/checkpoints \ -v /path/to/checkpoints/output:/model_repository \ nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3 \ bash -c 'export PYTHONPATH=/opt/bignlp/FasterTransformer:${PYTHONPATH} && \ cd /opt/bignlp && \ python3 FasterTransformer/examples/pytorch/gpt/utils/nemo_ckpt_convert.py \ --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo \ --infer-gpu-num 1 \ --saved-dir /model_repository/gpt3_1.3b \ --weight-data-type fp16 \ --load-checkpoints-to-cpu 0 && \ python3 /opt/bignlp/bignlp-scripts/bignlp/collections/export_scripts/prepare_triton_model_config.py \ --model-train-name gpt3_1.3b \ --template-path /opt/bignlp/fastertransformer_backend/all_models/gpt/fastertransformer/config.pbtxt \ --ft-checkpoint /model_repository/gpt3_1.3b/1-gpu \ --config-path /model_repository/gpt3_1.3b/config.pbtxt \ --max-batch-size 256 \ --pipeline-model-parallel-size 1 \ --tensor-model-parallel-size 1 \ --data-type bf16'
These steps launch a Docker container to run the conversions. The following list is of a few important parameters and their functions:
-v /path/to/checkpoints:/checkpoints
: Specify the local directory where checkpoints were saved. This is the directory that was mentioned during the checkpoint download step earlier. The final :/checkpoints directory in the command should stay the same.-v /path/to/checkpoint/output:/model_repository
: Specify the local directory to save the converted checkpoints to. Make a note of this location as it is used in the deployment step later. The final :/model_repository directory in the command should stay the same.nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3
: If a newer image exists on NGC, replace the highlighted tag with the new version.--in-file /checkpoints/nemo_gpt1.3B_fp16.nemo
: The name of the downloaded checkpoint to convert. If you are using a different version, replace the name here.--infer-gpu-num 1
: This is the number of GPUs to use for the deployed model. If using more than one GPU, increase this number to the desired amount. The remainder of this post assumes that the value of 1 was used here.--model-train-name gpt3_1.3b
: The name of the deployed model. If you are using a different model name, make a note of the new name as NVIDIA Triton requests require the name to be specified.--tensor-model-parallel-size 1
: If you are using a different GPU count for inference, this number must be updated. The value should match that of--infer-gpu-num
from earlier.
After running the command, verify that the model has been converted by viewing the specified output directory. The output should be similar to the following (truncated for brevity):
$ ls -R output/ output/: gpt3_1.3b output/gpt3_1.3b: 1-gpu config.pbtxt output/gpt3_1.3b/1-gpu: config.ini merges.txt model.final_layernorm.bias.bin model.final_layernorm.weight.bin ...
Model deployment
Now that the model has been converted to a model repository, it can be deployed with Triton Inference Server. Do this using the NeMo framework Inference container, which has NVIDIA Triton built in.
By default, NVIDIA Triton uses three ports for HTTP, gRPC, and metric requests.
docker run --rm \ --name triton-inference-server \ -d \ --gpus all \ -p 8000-8002:8000-8002 \ -v /path/to/checkpoints/output:/model_repository \ nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 \ bash -c 'export CUDA_VISIBLE_DEVICES=0 && \ tritonserver --model-repository /model_repository'
-d
: This tells Docker to run the container in the background. The server remains online and available for requests until the container is killed.-p 8000-8002:8000-8002
: NVIDIA Triton communicates using ports 8000 for HTTP requests, 8001 for gRPC requests, and 8002 for metrics information. These ports are mapped from the container to the host, allowing the host to handle requests directly and route them to the container.-v /path/to/checkpoints/output:/model_repository
: Specify the location where the converted checkpoints were saved to on the machine. This should match the model repository location from the conversion step earlier.nvcr.io/ea-bignlp/bignlp-inference:22.08-py3
: If a newer version exists on NGC, replace the highlighted tag with the new version.export CUDA_VISIBLE_DEVICES=0
: Specify which devices to use. If the model was converted to use multiple GPUs earlier, this should be a comma-separated list of the GPUs up to the desired number. For example, if you are using four GPUs, this should beCUDA_VISIBLE_DEVICES=0,1,2,3
.
To verify that the container was launched successfully, run docker ps, which should show output similar to the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f25cf23b75b7 nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 "/opt/nvidia/nvidia_…" About a minute ago Up About a minute 0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp triton-inference-server
Check the logs to see if the model was deployed and ready for requests (output truncated for brevity).
$ docker logs triton-inference-server I0928 14:29:34.011299 1 server.cc:629] +-----------+---------+--------+ | Model | Version | Status | +-----------+---------+--------+ | gpt3_1.3b | 1 | READY | +-----------+---------+--------+ I0928 14:29:34.131430 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB I0928 14:29:34.132280 1 tritonserver.cc:2176] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.24.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /model_repository | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ I0928 14:29:34.133520 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001 I0928 14:29:34.133751 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000 I0928 14:29:34.174655 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
If the output is similar to what’s shown here, the model is ready to receive inference requests.
Sending inference requests
With a local Triton Inference Server running, you can start sending inference requests to the server. NVIDIA Triton’s client API supports multiple languages including Python, Java, and C++. For the purposes of this post, we provide a sample Python application.
from argparse import ArgumentParser import numpy as np import tritonclient.http as httpclient from tritonclient.utils import np_to_triton_dtype from transformers import GPT2Tokenizer def fill_input(name, data): infer_input = httpclient.InferInput(name, data.shape, np_to_triton_dtype(data.dtype)) infer_input.set_data_from_numpy(data) return infer_input def build_request(query, host, output): with httpclient.InferenceServerClient(host) as client: request_data = [] request = np.array([query]).astype(np.uint32) request_len = np.array([[len(query)]]).astype(np.uint32) request_output_len = np.array([[output]]).astype(np.uint32) top_k = np.array([[1]]).astype(np.uint32) top_p = np.array([[0.0]]).astype(np.float32) temperature = np.array([[1.0]]).astype(np.float32) request_data.append(fill_input('input_ids', request)) request_data.append(fill_input('input_lengths', request_len)) request_data.append(fill_input('request_output_len', request_output_len)) request_data.append(fill_input('runtime_top_k', top_k)) request_data.append(fill_input('runtime_top_p', top_p)) request_data.append(fill_input('temperature', temperature)) result = client.infer('gpt3_1.3b', request_data) output = result.as_numpy('output_ids').squeeze() return output def main(): parser = ArgumentParser('Simple Triton Inference Requestor') parser.add_argument('query', type=str, help='Enter a text query to send to ' 'the Triton Inference Server in quotes.') parser.add_argument('--output-length', type=int, help='Specify the desired ' 'length for output.', default=30) parser.add_argument('--server', type=str, help='Specify the host:port that ' 'Triton is listening on. Defaults to localhost:8000', default='localhost:8000') args = parser.parse_args() tokenizer = GPT2Tokenizer.from_pretrained('gpt2') query = tokenizer(args.query).input_ids request = build_request(query, args.server, args.output_length) print(tokenizer.decode(request)) if __name__ == '__main__': main()
At a high level, the script does the following:
- Takes an input request from the user, such as, “Hello there! How are you today?”
- Tokenizes the input using a pretrained GPT-2 tokenizer from HuggingFace.
- Builds an inference request using several required and optional parameters, such as request, temperature, output length, and so on.
- Sends the request to NVIDIA Triton.
- Decodes the response using the tokenizer from earlier.
To run the code, several Python dependencies are required. These packages can be installed by running the following command:
$ pip3 install numpy tritonclient[http] transformers
After the dependencies are installed, save the code to a local file and name it infer.py. Next, run the application as follows:
$ python3 infer.py "1 2 3 4 5 6"
This sends the prompt “1 2 3 4 5 6” to the local inference server and should output the following to complete the sequence up to the default response token limit of 30:
“1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36"
The server can now respond to any HTTP requests using this basic formula and can support multiple concurrent requests both locally and remote.
Summary
Large language models are powering a growing number of applications. With the public release of several NeMo framework models, it’s now possible to deploy trained models locally.
This post outlined how to deploy public NeMo framework models using a simple Python script. You can test more robust models and use cases by downloading the larger models hosted on HuggingFace.
For more information about using NeMo framework, see the NeMo framework documentation and NVIDIA/nemo GitHub repo.