# 借助 NVIDIA TensorRT-LLM 和 NVIDIA Triton 部署 AI 编码助手

AI 编码助手（或代码 LLM）已成为帮助实现这一目标的一个领域。到 2025 年，80% 的产品开发生命周期将使用 生成式 AI 进行代码生成，开发者将充当后端和前端组件及集成的验证者和编排者。您可以调整用于代码任务的 LLM，简化开发者的工作流程，并降低新手编程人员的门槛。Code LLM 不仅可以生成代码，还可以填充缺失的代码、添加文档，并提供解决难题的提示。

## 部署自己的 AI 编码助手

• 拥抱 Face 注册用户访问权限并基本熟悉 Transformer 库
• Python
• NVIDIA TensorRT-LLM 优化库
• 搭载 TensorRT-LLM 后端的 NVIDIA Triton

## 代码 LLM 提示

Write a function that computes the square root。

 # Use Newton's method, # where x_(n+1) = 1/2 * (x_n + (y/x_n)) # y = number to find square root of # x_0 = first guess # epsilon = how close is close enough? # Implement this in a function called newton_sqrt that has three parameters # and returns one value.

Write Python code to compute the square root and print the result。

 # To find the square root of a number in Python, you can use the math library and its sqrt function:   from math import sqrt   number = float(input('Enter a number: ')) square_root = sqrt(number) print(f'The square root of {number} is approximately {square_root:.2f}.')

## 代码 LLM 的提示工程

### 添加示例输出

Write a function for calculating average water use per household。

Write a function for calculating average water use per household。Example output ["Smith", 20, "Lincoln", 30, "Simpson", 1500]

Write a function for calculating average water use per household。Example input [['Last Name', 'Dishwasher', 'Shower', 'Yard'], ['Smith', 39, 52, 5], ['Lincoln', 25, 77, 8], ['Simpson', 28, 20, 0]]Example output ["Smith", 20, "Lincoln", 30, "Simpson", 1500]

### 实验

• Write a function for calculating average water use per household。
• Write a function for calculating average water use per household. Add penalties for dishwasher time。
• Write a Python function for calculating average water use per household. Add penalties for dishwasher time. Input is family name, address, and number of gallons per specific use。

## 设置和构建 TensorRT-LLM

 git lfs install git clone -b release/0.7.1 https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM git submodule update --init --recursive make -C docker release_build

## 检索模型权重

 cd examples git clone https://huggingface.co/bigcode/starcoder

## 转换模型权重格式

 # Launch the Tensorrt-LLM container make -C docker release_run LOCAL_USER=1   cd examples/gpt   python3 hf_gpt_convert.py -p 8 --model starcoder -i ../starcoder -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16

## 编译模型

 python3 examples/gpt/build.py \     --model_dir examples/gpt/c-model/starcoder/1-gpu \     --dtype float16 \     --use_gpt_attention_plugin float16 \     --use_gemm_plugin float16 \     --remove_input_padding \     --use_inflight_batching \     --paged_kv_cache \     --output_dir examples/gpt/out

## 运行模型

TensorRT-LLM 包含高度优化的 C++运行时，用于执行构建的 LLM 引擎和管理流程，例如从模型输出中采样令牌、管理 KV 缓存以及同时处理批处理请求。您可以直接使用该运行时在本地执行模型。

 python3 examples/gpt/run.py --engine_dir=examples/gpt/out --max_output_len 100 --tokenizer examples/starcoder/starcoder --input_text "Write a function that computes the square root."

 python3 examples/gpt/run.py --engine_dir=examples/gpt/out --max_output_len 100 --tokenizer examples/starcoder/starcoder --input_text "X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.1) # Train a logistic regression model, predict the labels on the test set and compute the accuracy score"

## 在 NVIDIA Triton 上部署模型

• /preprocessing/postprocessing：包含适用于 Python 的 Triton 后端，用于在字符串和模型运行所用的标记 ID 之间进行转换，实现文本输入的标记化和模型输出的去标记化。
• /tensorrt_llm：用于存储您之前编译的模型引擎的子文件夹。
• /ensemble：定义模型集成，它将前三个组件连接在一起，并告诉 Triton 如何通过它们流动数据。
 # After exiting the TensorRT-LLM docker container cd .. git clone -b release/0.7.1 \ https://github.com/triton-inference-server/tensorrtllm_backend.git cd tensorrtllm_backend cp ../TensorRT-LLM/examples/gpt/out/*   all_models/inflight_batcher_llm/tensorrt_llm/1/

 python3 tools/fill_template.py --in_place \       all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \       decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\ max_tokens_in_paged_kv_cache:,batch_scheduler_policy:guaranteed_completion,kv_cache_free_gpu_mem_fraction:0.2,\ max_num_sequences:4

 python tools/fill_template.py --in_place \     all_models/inflight_batcher_llm/preprocessing/config.pbtxt \     tokenizer_type:auto,tokenizer_dir:/all_models/startcoder    python tools/fill_template.py --in_place \     all_models/inflight_batcher_llm/postprocessing/config.pbtxt \     tokenizer_type:auto,tokenizer_dir:/all_models/startcoder

## 启动 Triton

 docker run -it --rm --gpus all --network host --shm-size=1g \ -v \$(pwd)/all_models:/all_models \ -v \$(pwd)/scripts:/opt/scripts \ nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3    # Log in to huggingface-cli to get tokenizer huggingface-cli login --token *****    # Install python dependencies pip install sentencepiece protobuf    # Launch Server python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1

## 发送请求

 curl -X POST localhost:8000/v2/models/ensemble/generate -d \ '{ "text_input": "write in python code that plots in a image circles with different radiuses", "parameters": { "max_tokens": 100, "bad_words":[""], "stop_words":[""] } }'

 import numpy as np import matplotlib.pyplot as plt   # Fixing random state for reproducibility np.random.seed(19680801)   N = 100 r0 = 0.6 x = np.random.rand(N) y = np.random.rand(N) area = np.pi * (10 * np.random.rand(N))**2 # 0 to 10 point radiuses c = np.random.rand(N)   plt.scatter(x, y, s=area, c=c, alpha=0.5) plt.show()

## 开始使用

NVIDIA TensorRT-LLM 和 NVIDIA Triton 为众多流行的 LLM 架构提供了基准支持，使得部署、优化和运行各种代码 LLM 变得轻松。要开始使用，请下载并设置 NVIDIA/TensorRT-LLM GitHub 上的开源库，并尝试使用不同的 LLM 示例。此外，您还可以下载 StarCoder 模型，并按照此博文中的步骤进行动手实践。