NeMo Framework で日本語 LLM を簡単デプロイ - オンライン推論編 -

Reading Time: 3 minutes

NeMo Framework とは

NeMo Framework は、生成 AI モデルのワークフローをエンドツーエンドでカバーするフレームワークで、現在は LLM を中心とした NLP 向けのコンテナーが NGC 上で公開されています。NeMo Framework のコンテナーを入手するには、こちらからアクセスして ”Download now-language” をクリックして下さい。承認後に NGC 上で NeMo Framework のコンテナーが入手できます。

NVIDIA AI Enterprise ライセンスをお持ちの方は、NGC サイトから入手可能です。NGC へログイン後、Enterprise Catalog にある ”Feature Branches & Models” にアクセスしてください。こちらで NeMo Framework の入手方法をご案内しています。

NeMo Framework のコンテナーイメージには、Training と Inference がありますが、本記事ではモデルのデプロイに使用する Inference コンテナーについて主に解説します。NeMo Framework の更なる詳細、NeMo Framework Training コンテナーについては、こちらの記事をご確認下さい。

現在、用途毎に機能を細分化して Microservice として提供する NeMo Microservice の Early Access が開始されています。推論用 Microservice である NeMo Inference Microservice に徐々に移行していく予定ですが、今回は今すぐ使える NeMo Inference コンテナーを用いた GPU 推論について解説します。

NeMo Framework Inference コンテナーには以下の NVIDIA LLM Inference ソフトウェアスタック構成図で示すように、TensorRT-LLM、TensorRT-LLM と連携可能な Triton Inference Server など、モデルのデプロイに必要な様々なライブラリが含まれています。

TensorRT-LLM は wheel package が公開されており pip コマンドで簡単にインストールする事ができる為、TensorRT-LLM や Triton Inference Server を個々にデプロイ環境へインストールする事も可能です。

敢えて NeMo Framework Inference コンテナーを介してそれらを使用するメリットとして、TensorRT-LLM 用の推論エンジン生成が容易になったり、Triton Inference Server 起動の為に必要な設定が自動生成されるという事、NeMo Framework Inference コンテナーオリジナルの LLM 推論の為の Triton Inference Server python backend を使用できるなどが挙げられます。

LLM 推論の為のライブラリ TensorRT-LLM について、以下でもう少し詳しく紹介したいと思います。

TensorRT-LLM とは

大規模言語モデル (LLM) はその驚くべき性能により AI で可能な領域を拡大しています。その一方 LLM のパラメーター数は膨大であり、コスト効率の高い推論を行う為には推論の最適化は必須の作業です。最適化をしない、あるいは適切ではない場合、推論処理の実行は遅くなりランニングコストは大きくなってしまいます。

推論の最適化には、例えばカーネル融合や量子化のようなモデル最適化から、処理の C++ 化、KV キャッシュ、継続的なインフライトバッチング処理、高速な attention の導入等のランタイム最適化まで様々の最適化手法が存在します。

LLM 推論の１つの課題はこれら多くの最適化手法の中から、自分のユースケースに適した手法を選択し、互換性の問題を解決しながらモデル毎に実装していくという作業です。

これらの課題を解決する為に、NVIDIA は推論用に LLM をコンパイルおよび最適化するための包括的なライブラリである TensorRT-LLM をリリースしました。TensorRT-LLM を用いると、モデル毎にコードの最適化をする必要はなく、TensorRT-LLM のモデル最適化コンパイラを用いて自動で最適な手法が選択されます。

以下に TensorRT-LLM の特徴を列挙します。

マルチ GPU マルチノード推論のサポート
インフライトバッチングのサポートおよび、Paged Attention、Flash Attention に代表される高速な attention のサポート
使いやすい Python API 提供
Triton Inferense Server と連携する為のバックエンドの提供
Windows 環境のサポート (ベータ機能)

TensorRT-LLM の推論性能については、こちらの NVIDIA の技術ブログ記事をご確認下さい。

ここからは、以下のチュートリアルで NeMo Framework を用いた推論の具体的な手順をご紹介したいと思います。

GPU 推論チュートリアル

本記事では、Hugging Face Model Hub から日本語 LLM をダウンロードして、NeMo Framework Inference コンテナーを使用して GPU 推論する方法について解説します。

本チュートリアルでの手順は以下の通りです。

事前準備
- NeMo Framework Training コンテナーの起動
- Hugging Face Model Hub からモデルのダウンロード
- モデルを NeMo フォーマットに変換
- (オプション) PEFT の実行
NeMo Framework Inference コンテナーの起動
TensorRT-LLM 推論用エンジンの生成
Triton Inference Server を起動
クライアントから推論リクエストを行う

また、今回のチュートリアルの検証環境は以下の条件で行っております。

ハードウェア
- DGX A100
  - GPU: 8x NVIDIA A100 80 GB GPUs (driver version: 535.129.03)
  - CPU: Dual AMD Rome 7742 (128 コア, 2.25 GHz (base))
  - システムメモリ: 2 TB
ソフトウェア
- OS: Ubuntu 22.04.3 LTS
- Docker: 23.0.4
- Container: nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10

事前準備

以下のコマンドで作業用のディレクトリを作成し、移動します。

mkdir inference-example
cd inference-example

NeMo Framework Training コンテナーの起動

以下のコマンドで NeMo Framework Training コンテナーを起動します。NeMo ファイルの変換や PEFT には apex モジュールが必要な為、Inference コンテナーではなく Training コンテナーでの作業が必要です。

sudo docker run --rm -it --gpus device=0 --shm-size=2g --ulimit memlock=-1 --network=host -v ${PWD}:/workspace  -w /workspace -v ${PWD}/results:/workspace/results nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 bash

Hugging Face Model Hub からモデルのダウンロード

このチュートリアルでは、tokyotech-llm/Swallow-7b-instruct-hf を使用します。以下のコードで Hugging Face の Model Hub から訓練済みの LLM をダウンロードします。

import os
from huggingface_hub import snapshot_download
 
MODEL_DIR = "./models"
os.makedirs(MODEL_DIR, exist_ok=True)
 
snapshot_download(
    repo_id="tokyotech-llm/Swallow-7b-instruct-hf",
    local_dir=f"{MODEL_DIR}/Swallow-7b-instruct-hf",
    local_dir_use_symlinks=False
    )

NeMo フォーマットへの変換

以下のスクリプトを使用して、ダウンロードした Hugging Face の Llama モデルを NeMo フォーマットへ変換します。

python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
    --in-file=./models/Swallow-7b-instruct-hf \
    --out-file=./models/Swallow-7b-instruct-hf/Swallow-7b-instruct-hf.nemo \
    --precision="16"

(オプション) PEFT の実行

推論時に PEFT の結果を用いて推論をしたい場合、NeMo Framework で日本語 LLM をファインチューニング – PEFT 編 – を参考に NeMo Framework で PEFT を実行し swallow_7b_instruct_ptuning.nemo を作成して下さい。

NeMo Framework Inference コンテナーの起動

モデルの NeMo 形式への変換が終わったら、実際に推論を行ってみましょう。NGC から最新の NeMo Framework Inference コンテナーを pull し、以下のコマンドで起動します。

sudo docker run --rm -it --gpus device=0 --shm-size=4g -p 8000:8000 -p 8888:8888 -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/workspace/results nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10 bash

TensorRT-LLM 推論用エンジンの生成

NeMo Framework 付属の export_to_trt.py スクリプトを使用して NeMo 形式のモデルから TensorRT-LLM 推論用エンジンを作成します。

python /opt/NeMo/scripts/export/export_to_trt.py \
    --nemo_checkpoint /workspace/models/Swallow-7b-instruct-hf/Swallow-7b-instruct-hf.nemo \
    --model_type="llama" \
    --model_repository /opt/checkpoints/trtllm-engine

上記スクリプトを実行すると /opt/checkpoints/trtllm-engine に TensorRT-LLM エンジンが生成されます。

Triton Inference Server の起動

生成した TensorRT-LLM 推論用エンジンが保存されているパスを指定して deploy_triton.py を用いて Triton Inference Server を起動します。

python /opt/NeMo/scripts/deploy/deploy_triton.py \
    --triton_model_name swallow-7b-bf16 \
    --triton_model_repository /opt/checkpoints/trtllm-engine \
    --model_type="llama"

以下のようなログが標準出力に出て、サーバーが起動すれば成功です。

I1225 07:51:09.942834 2329 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001 
I1225 07:51:09.943047 2329 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000 
I1225 07:51:09.998750 2329 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

クライアントから推論リクエストを行う

推論サーバーが起動したので、以下のスクリプトを用いて、クライアントからサーバーに推論リクエストを投げてみましょう。

from pytriton.client import ModelClient
import numpy as np

def query_llm(url, model_name, prompts, max_output_token=128, top_k=1, top_p=0.0, temperature=1.0,  init_timeout=600.0):
   str_ndarray = np.array(prompts)[..., np.newaxis]
   prompts = np.char.encode(str_ndarray, "utf-8")
   max_output_token = np.full(prompts.shape, max_output_token, dtype=np.int_)
   top_k = np.full(prompts.shape, top_k, dtype=np.int_)
   top_p = np.full(prompts.shape, top_p, dtype=np.single)
   temperature = np.full(prompts.shape, temperature, dtype=np.single)

   with ModelClient(url, model_name, init_timeout_s=init_timeout) as client:
      result_dict = client.infer_batch(
            prompts=prompts,
            max_output_token=max_output_token,
            top_k=top_k,
            top_p=top_p,
            temperature=temperature,
      )
      output_type = client.model_config.outputs[0].dtype

   if output_type == np.bytes_:
      sentences = np.char.decode(result_dict["outputs"].astype("bytes"), "utf-8")
      return sentences
   else:
      return result_dict["outputs"]

output = query_llm(
   url="localhost:8000",
   model_name="swallow-7b-bf16",
   prompts=["以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。リクエストを適切に完了するための回答を記述してください。\n\n### 指示:\n与えられた選択肢の中から、最適な答えを選んでください。出力は以下から選択してください：\n- 掲示板\n- パソコン\n- マザーボード\n- ハードディスク\n- まな板\n### 入力:\n電子機器で使用される最も主要な電子回路基板の事をなんと言う？\n### 応答:"],
   max_output_token=10,
   top_k=1,
   top_p=0.0,
   temperature=1.0,
)
print(output)

以下のようにサーバーからの応答があり、TensorRT-LLM ベースの推論を行う事が出来ました。

[['マザーボード']]

補足情報

NeMo P-Tuning ファイルを用いた推論

PEFT の結果を用いた推論を行いたい場合は、export_to_trt.py スクリプトを使用して TensorRT-LLM 推論用エンジンを生成する際に、以下の例のように --ptuning_nemo_checkpoint オプションを用いて PEFT 結果ファイルの保存パスを指定してエンジンを生成する必要があります。

python /opt/NeMo/scripts/export/export_to_trt.py \
    --ptuning_nemo_checkpoint /workspace/results/swallow_7b_instruct_ptuning/checkpoints/swallow_7b_instruct_ptuning.nemo \
    --nemo_checkpoint /workspace/models/Swallow-7b-instruct-hf/Swallow-7b-instruct-hf.nemo \
    --model_type="llama" \
    --model_repository /opt/checkpoints/trtllm-engine

生成に成功すると、/opt/checkpoints/trtllm-engine に TensorRT-LLM エンジンと共に __prompt_embeddings__.npy が作成されます。その後、deploy_triton.py を用いて Triton Inference Server を起動して下さい。

推論性能測定

NeMo Framework には Triton Inference Server の推論性能を計測する為の便利なスクリプトが付属されています。以下のコマンドでバッチサイズ毎のスループットとレイテンシを測定する事が可能です。-mil で最大インプット長、-mol で最大アウトプット長、-mbs で最大バッチサイズを指定して計測が可能です。本スクリプトで現状の性能を確認し、それを参考に Triton Inference Server の設定を変更したり、更なる最適化を行ってください。

python /opt/NeMo/scripts/deploy/benchmark.py \
    -tlf /opt/checkpoints/trtllm-engine/ \
    --nemo_checkpoint /workspace/models/Swallow-7b-instruct-hf/Swallow-7b-instruct-hf.nemo \
    --model_type="llama" \
    --triton_model_name swallow-7b-bf16 \
    -ng 1 -mil 150 -mol 20 -mbs 10 -nr 30

まとめ

本記事では、NeMo Framework を使用した日本語 LLM 推論の方法について紹介しました。NeMo Framework を使用して LLM の開発が加速すると嬉しいです。

NeMo Framework で日本語 LLM を簡単デプロイ – オンライン推論編 –

NeMo Framework とは

TensorRT-LLM とは

GPU 推論チュートリアル

事前準備

NeMo Framework Training コンテナーの起動

Hugging Face Model Hub からモデルのダウンロード

NeMo フォーマットへの変換

(オプション) PEFT の実行

NeMo Framework Inference コンテナーの起動

TensorRT-LLM 推論用エンジンの生成

Triton Inference Server の起動

クライアントから推論リクエストを行う

補足情報

NeMo P-Tuning ファイルを用いた推論

推論性能測定

まとめ

関連情報

Tags

About the Authors

NeMo Framework で日本語 LLM を簡単デプロイ – オンライン推論編 –

NeMo Framework とは

TensorRT-LLM とは

GPU 推論チュートリアル

事前準備

NeMo Framework Training コンテナーの起動

Hugging Face Model Hub からモデルのダウンロード

NeMo フォーマットへの変換

(オプション) PEFT の実行

NeMo Framework Inference コンテナーの起動

TensorRT-LLM 推論用エンジンの生成

Triton Inference Server の起動

クライアントから推論リクエストを行う

補足情報

NeMo P-Tuning ファイルを用いた推論

推論性能測定

まとめ

関連情報

Tags

About the Authors

関連記事

NVIDIA GB200 NVL72 は兆単位パラメーターの LLM トレーニングとリアルタイム推論を実現

cuDLA による NVIDIA Jetson Orin 上での YOLOv5 の紹介

NeMo Framework で日本語 LLM をファインチューニング - PEFT 編 -

TAO Toolkit 5.0 に追加された Data Service を活用

前編: Stable Diffusion を TensorRT で GPU 推論を数倍高速化