Conversational AI

Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Framework

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are built using the transformer network introduced by Google in 2017 in the Attention Is All You Need research paper.

NVIDIA NeMo framework is an end-to-end GPU-accelerated framework for training and deploying transformer-based LLMs up to a trillion parameters. In September 2022, NVIDIA announced that NeMo framework is now available in Open Beta, allowing you to train and deploy LLMs using your own data. With this announcement, several pretrained checkpoints have been uploaded to HuggingFace, enabling anyone to deploy LLMs locally using GPUs.

This post walks you through the process of downloading, optimizing, and deploying a 1.3 billion parameter GPT-3 model using the NeMo framework. It includes NVIDIA Triton Inference Server, a powerful open-source, inference-serving software that can deploy a wide variety of models and serve inference requests on both CPUs and GPUs in a scalable manner.

System requirements

While training LLMs requires massive amounts of compute power, trained models can be deployed for inference at a much smaller scale for most use cases.

The models from HuggingFace can be deployed on a local machine with the following specifications:

  • Running a modern Linux OS (tested with Ubuntu 20.04).
  • An NVIDIA Ampere architecture GPU or newer with at least 8 GB of GPU memory.
  • At least 16 GB of system memory.
  • Docker version 19.03 or newer with the NVIDIA Container Runtime.
  • Python 3.7 or newer with PIP.
  • A reliable Internet connection for downloading models.
  • Permissive firewall, if serving inference requests from remote machines.


The NeMo framework is now in Open Beta and available for anyone who completes the free registration form. Registration is required to gain access to the training and inference containers, as well as helper scripts to convert and deploy trained models.

Several trained NeMo framework models are hosted publicly on HuggingFace, including 1.3B, 5B, and 20B GPT-3 models. These models have been converted to the .nemo format which is optimized for inference.

Converted models cannot be retrained or fine-tuned, but they enable fully trained models to be deployed for inference. These models are significantly smaller in size compared to the pre-conversion checkpoints and are supported by the FasterTransformer (FT) format. FasterTransformer is a backend in Triton Inference Server to run LLMs across GPUs and nodes.

For the purposes of this post, we used the 1.3B model, which has the quickest inference speeds and can comfortably fit in memory for most modern GPUs.

To convert the model, run the following steps.

Download the 1.3B model to your system. Run the following command in the desired directory to keep converted models for NVIDIA Triton to read:


Make a note of the folder to which the model was copied, as it is used throughout the remainder of this post.

Verify the MD5sum of the downloaded file:

$ md5sum nemo_gpt1.3B_fp16.nemo
38f7afe7af0551c9c5838dcea4224f8a  nemo_gpt1.3B_fp16.nemo

Use a web browser to log in to NGC at Enter the Setup menu by selecting your account name. Select Get API Key followed by Generate API Key to create the token. Make a note of the key as it is only shown one time.

In the terminal, add the token to Docker:

$ docker login
Username: $oauthtoken
Password: <insert token here>

Replace <insert token here> with the token that was generated. The username must be exactly $oauthtoken, as this indicates that a personal access token is being used.

Pull the latest training and inference images for the NeMo framework:

$ docker pull
$ docker pull

At the time of publication, the latest image tags are 22.08.01-py3 for training and 22.08-py3 for inference. We recommend checking for newer tags on NGC and pulling those, if available.

Verify that the images were pulled successfully, as the IDs might change with different tags:

$ docker images | grep "ea-bignlp/bignlp"                       22.08.01-py3                         d591b7488a47   11 days ago     17.3GB                      22.08-py3                            77a6681df8d6   2 weeks ago     12.2GB

Model conversion

To optimize throughput and latency of the model, it can be converted to the FT format, which contains performance modifications to the encoder and decoder layers in the transformer architecture.

FT can serve inference requests with 3x quicker latencies or more compared to their non-FT counterparts. The NeMo framework training container includes the FT framework as well as scripts to convert a .nemo file to the FT format.

Triton Inference Server expects models to be stored in a model repository. Model repositories contain checkpoints and model-specific information that Triton Inference Server reads to tune the model at deployment time. As with the FT framework, the NeMo framework training container includes scripts to convert the FT model to a model repository for Triton.

Converting a model to the FT format and creating a model repository for the converted model can be done in one pass in a Docker container. To create an FT-based model repository, run the following command. Items that might have to change are in bold.

docker run --rm \
    --gpus all \
    --shm-size=16GB \
    -v /path/to/checkpoints:/checkpoints \
    -v /path/to/checkpoints/output:/model_repository \ \
    bash -c 'export PYTHONPATH=/opt/bignlp/FasterTransformer:${PYTHONPATH} && \
    cd /opt/bignlp && \
    python3 FasterTransformer/examples/pytorch/gpt/utils/ \
        --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo \
        --infer-gpu-num 1 \
        --saved-dir /model_repository/gpt3_1.3b \
        --weight-data-type fp16 \
        --load-checkpoints-to-cpu 0 && \
    python3 /opt/bignlp/bignlp-scripts/bignlp/collections/export_scripts/ \
        --model-train-name gpt3_1.3b \
        --template-path /opt/bignlp/fastertransformer_backend/all_models/gpt/fastertransformer/config.pbtxt \
        --ft-checkpoint /model_repository/gpt3_1.3b/1-gpu \
        --config-path /model_repository/gpt3_1.3b/config.pbtxt \
        --max-batch-size 256 \
        --pipeline-model-parallel-size 1 \
        --tensor-model-parallel-size 1 \
        --data-type bf16'

These steps launch a Docker container to run the conversions. The following list is of a few important parameters and their functions:

  • -v /path/to/checkpoints:/checkpoints: Specify the local directory where checkpoints were saved. This is the directory that was mentioned during the checkpoint download step earlier. The final :/checkpoints directory in the command should stay the same.
  • -v /path/to/checkpoint/output:/model_repository: Specify the local directory to save the converted checkpoints to. Make a note of this location as it is used in the deployment step later. The final :/model_repository directory in the command should stay the same.
  • If a newer image exists on NGC, replace the highlighted tag with the new version.
  • --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo: The name of the downloaded checkpoint to convert. If you are using a different version, replace the name here.
  • --infer-gpu-num 1: This is the number of GPUs to use for the deployed model. If using more than one GPU, increase this number to the desired amount. The remainder of this post assumes that the value of 1 was used here.
  • --model-train-name gpt3_1.3b: The name of the deployed model. If you are using a different model name, make a note of the new name as NVIDIA Triton requests require the name to be specified.
  • --tensor-model-parallel-size 1: If you are using a different GPU count for inference, this number must be updated. The value should match that of --infer-gpu-num from earlier.

After running the command, verify that the model has been converted by viewing the specified output directory. The output should be similar to the following (truncated for brevity):

$ ls -R output/

1-gpu  config.pbtxt


Model deployment

Now that the model has been converted to a model repository, it can be deployed with Triton Inference Server. Do this using the NeMo framework Inference container, which has NVIDIA Triton built in.

By default, NVIDIA Triton uses three ports for HTTP, gRPC, and metric requests.

docker run --rm \
    --name triton-inference-server \
    -d \
    --gpus all \
    -p 8000-8002:8000-8002 \
    -v /path/to/checkpoints/output:/model_repository \ \
    bash -c 'export CUDA_VISIBLE_DEVICES=0 && \
    tritonserver --model-repository /model_repository'
  • -d: This tells Docker to run the container in the background. The server remains online and available for requests until the container is killed.
  • -p 8000-8002:8000-8002: NVIDIA Triton communicates using ports 8000 for HTTP requests, 8001 for gRPC requests, and 8002 for metrics information. These ports are mapped from the container to the host, allowing the host to handle requests directly and route them to the container.
  • -v /path/to/checkpoints/output:/model_repository: Specify the location where the converted checkpoints were saved to on the machine. This should match the model repository location from the conversion step earlier.
  • If a newer version exists on NGC, replace the highlighted tag with the new version.
  • export CUDA_VISIBLE_DEVICES=0: Specify which devices to use. If the model was converted to use multiple GPUs earlier, this should be a comma-separated list of the GPUs up to the desired number. For example, if you are using four GPUs, this should be CUDA_VISIBLE_DEVICES=0,1,2,3.

To verify that the container was launched successfully, run docker ps, which should show output similar to the following:

CONTAINER ID   IMAGE                                          COMMAND                  CREATED              STATUS              PORTS                                                           NAMES
f25cf23b75b7   "/opt/nvidia/nvidia_…"   About a minute ago   Up About a minute>8000-8002/tcp, :::8000-8002->8000-8002/tcp   triton-inference-server

Check the logs to see if the model was deployed and ready for requests (output truncated for brevity).

$ docker logs triton-inference-server
I0928 14:29:34.011299 1] 
| Model     | Version | Status |
| gpt3_1.3b | 1       | READY  |

I0928 14:29:34.131430 1] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0928 14:29:34.132280 1] 
| Option                           | Value                                                                                                                                                                                        |
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.24.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /model_repository                                                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |

I0928 14:29:34.133520 1] Started GRPCInferenceService at
I0928 14:29:34.133751 1] Started HTTPService at
I0928 14:29:34.174655 1] Started Metrics Service at

If the output is similar to what’s shown here, the model is ready to receive inference requests.

Sending inference requests

With a local Triton Inference Server running, you can start sending inference requests to the server. NVIDIA Triton’s client API supports multiple languages including Python, Java, and C++. For the purposes of this post, we provide a sample Python application.

from argparse import ArgumentParser
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
from transformers import GPT2Tokenizer

def fill_input(name, data):
    infer_input = httpclient.InferInput(name, data.shape, np_to_triton_dtype(data.dtype))
    return infer_input

def build_request(query, host, output):
    with httpclient.InferenceServerClient(host) as client:
        request_data = []
        request = np.array([query]).astype(np.uint32)
        request_len = np.array([[len(query)]]).astype(np.uint32)
        request_output_len = np.array([[output]]).astype(np.uint32)
        top_k = np.array([[1]]).astype(np.uint32)
        top_p = np.array([[0.0]]).astype(np.float32)
        temperature = np.array([[1.0]]).astype(np.float32)

        request_data.append(fill_input('input_ids', request))
        request_data.append(fill_input('input_lengths', request_len))
        request_data.append(fill_input('request_output_len', request_output_len))
        request_data.append(fill_input('runtime_top_k', top_k))
        request_data.append(fill_input('runtime_top_p', top_p))
        request_data.append(fill_input('temperature', temperature))
        result = client.infer('gpt3_1.3b', request_data)
        output = result.as_numpy('output_ids').squeeze()
        return output

def main():
    parser = ArgumentParser('Simple Triton Inference Requestor')
    parser.add_argument('query', type=str, help='Enter a text query to send to '
                        'the Triton Inference Server in quotes.')
    parser.add_argument('--output-length', type=int, help='Specify the desired '
                        'length for output.', default=30)
    parser.add_argument('--server', type=str, help='Specify the host:port that '
                        'Triton is listening on. Defaults to localhost:8000',
    args = parser.parse_args()

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    query = tokenizer(args.query).input_ids
    request = build_request(query, args.server, args.output_length)

if __name__ == '__main__':

At a high level, the script does the following:

  1. Takes an input request from the user, such as, “Hello there! How are you today?”
  2. Tokenizes the input using a pretrained GPT-2 tokenizer from HuggingFace.
  3. Builds an inference request using several required and optional parameters, such as request, temperature, output length, and so on.
  4. Sends the request to NVIDIA Triton.
  5. Decodes the response using the tokenizer from earlier.

To run the code, several Python dependencies are required. These packages can be installed by running the following command:

$ pip3 install numpy tritonclient[http] transformers

After the dependencies are installed, save the code to a local file and name it Next, run the application as follows:

$ python3 "1 2 3 4 5 6"

This sends the prompt “1 2 3 4 5 6” to the local inference server and should output the following to complete the sequence up to the default response token limit of 30:

“1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36"

The server can now respond to any HTTP requests using this basic formula and can support multiple concurrent requests both locally and remote.


Large language models are powering a growing number of applications. With the public release of several NeMo framework models, it’s now possible to deploy trained models locally.

This post outlined how to deploy public NeMo framework models using a simple Python script. You can test more robust models and use cases by downloading the larger models hosted on HuggingFace.

For more information about using NeMo framework, see the NeMo framework documentation and NVIDIA/nemo GitHub repo.

Discuss (3)