Generative AI

Spotlight: NAVER Place Optimizes SLM-Based Vertical Services with NVIDIA TensorRT-LLM

NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of businesses and points of interest across Korea. Users can search about different places, leave reviews, and place bookings or orders in real time. 

NAVER Place vertical services are based on small language models (SLMs) to improve usability and are specialized for Place, Map, and Travel. This post shares insights into how NVIDIA and NAVER optimized SLM inference performance using NVIDIA TensorRT-LLM to enable the SLM-based vertical services on NVIDIA Triton Inference Server. To learn more about how NAVER uses AI, see Introduction to NAVER Place AI Development Team.

Small language models for NAVER Place reviews

SLMs are AI models capable of understanding natural language with fewer parameters compared to large language models (LLMs). SLMs are known to work well with less memory and computational power when they are properly fine-tuned to specific domain tasks. 

NAVER Place uses the tailored SLMs with its in-house dataset to provide a summary (created from the reviews NAVER Place users have left) or a microreview explaining what each place is like.

Screenshot of NAVER Place service app showing review summary and microreview examples.
Figure 1. Review summary and microreview examples from the NAVER Place service app. Image courtesy of NAVER

Matching visits with places of interest using an SLM transformer decoder

NAVER Place collects receipts and payment histories from its registered places to show the visits and reviews of each place on the NAVER Map. To do so, NAVER Place provides a system that matches visits with places of interest (POIs). The system also discovers new POIs from blog posts, or checks for duplicate POIs to ensure data integrity and enhance the service quality. 

Screenshots of the NAVER Place POI matching service UI.
Figure 2. The NAVER Place POI matching service UI is enabled by an SLM transformer decoder. Image courtesy of NAVER

Adopting NVIDIA TensorRT-LLM for superior inference performance

NVIDIA TensorRT-LLM accelerates and optimizes inference performance for the LLMs on NVIDIA GPUs. It supports in-flight batching to maximize throughput and uses memory optimization methods for autoregressive models, such as paged KV cache and chunked context, to enhance memory efficiency.

NAVER Place adopted TensorRT-LLM, as it outperforms other LLM inference solutions in throughput, time to first token (TTFT), and time per output token (TPOT). TensorRT-LLM consistently delivers superior performance across various input lengths and output token scenarios.

Figure 3 compares the throughput of a popular alternative open-source LLM inference library and TensorRT-LLM for various input and output token lengths on NVIDIA A100 and NVIDIA H100 GPUs, measured using Qwen models.

Four charts showing QPS Comparison of TensorRT-LLM and a popular alternative library across different operation modes.
Figure 3. QPS Comparison of TensorRT-LLM and a popular alternative library across different operation modes. Image courtesy of NAVER

TensorRT-LLM performs better than the alternative library in all of the decode-prefill light, prefill heavy, decode heavy, and decode-prefill heavy operation modes. Among them, the decode heavy operation mode using SLMs delivers the strongest performance. In addition, because TensorRT-LLM provides the kernel that is optimized to the latest GPUs, it achieves especially high performance on NVIDIA Hopper architecture.

To understand how to evaluate performance using TensorRT-LLM, reference the performance overview available in the NVIDIA/TensorRT-LLM GitHub repo. To learn more about the basic tuning techniques for building TensorRT-LLM engines, see Best Practices for Tuning the Performance of TensorRT-LLM.

Inference optimization: A trade-off between throughput and latency

This section explores strategies for balancing throughput and latency in LLM inference in terms of batch size and memory optimization techniques, such as paged KV cache and in-flight batching.

Batch size

An LLM inference server processes requests as a batch to maximize its throughput. However, this in turn results in high latency. This trade-off means that, while a larger batch size can deliver higher throughput, it may also increase response times, necessitating a careful balance between efficiency and user experience (Figure 4). By tuning the batch size in accordance with your target TTFT and TPOT, you can optimize your system’s performance to better align with your specific service requirements.

Four graphs showing throughput and time per output token according to batch size.
Figure 4. Throughput and time per output token according to batch size. Image courtesy of NAVER

Paged KV cache and in-flight batching

TensorRT-LLM includes the paged KV cache option enabled by default, enhancing memory efficiency and increasing the upper-bound batch size to accommodate tasks requiring low latency as well as those demanding high throughput. This default setting ensures the model can scale smoothly to handle both latency-sensitive, real-time requests and bulk-processing scenarios that demand higher throughput, providing a more flexible and robust solution.

In-flight batching is also enabled by default in TensorRT-LLM, which can boost throughput. For most tasks, the NAVER team uses these two options as default.

One exception is when the service requires extremely low latency with a relatively small model and older GPUs. NAVER Place had a specific case requiring extremely low latency that showed better performance when the two options were turned off. For example, POI matching was the case where requests should be processed in real time, requiring extremely low latency. With the relatively small model size of 1.3 billion and comparatively older NVIDIA T4 GPU architecture, the service required the batch size of 1 to achieve minimal latency and the batching option to be turned off.

In addition, using the small model size of 1.3 billion with the batch size of 1 resulted in higher paging overhead compared to compute overhead, leading to increased latency and reduced QPS. To deal with this problem, the team adopted contiguous KV cache instead of paged KV cache because memory overhead is less of a concern under the given condition. This choice enabled us to meet strict real-time requirements for use cases such as POI matching.

PrecisionPaged KV cacheCache blocksInput/outputMax batch_sizeQPSLatency
(in sec)
FP16On7,110500/516.490.154
FP16Off7,110500/518.390.119
Table 1. Better QPS and latency when paged KV cache is disabled under the specific condition when the older GPU architecture is exploited with a small model and batch sizes

While POI matching requires minimal latency for its real-time service, it also demands high throughput for background matching. For this reason, we currently use different build options for each.

"build_config": {
        "max_input_len": 512,
        "max_output_len": 32,
        "max_batch_size": 1,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        ...
        "plugin_config": {
            ...
            "paged_kv_cache": false,
            ...
        }
    }
 "build_config": {
        "max_input_len": 512,
        "max_output_len": 32,
        "max_batch_size": 8,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        ...
        "plugin_config": {
            ...


            "paged_kv_cache": true,
            ...
        }
    }

Inference optimization: Downstream caching

This section explores optimization strategies that leverage caching techniques to streamline downstream inference tasks. We examine how prefix and response caching can help reduce redundant computations and enhance overall efficiency.

Prefix caching

Because the prompts generated by downstream tasks have a common prefix, calculating the entire prefill for every request would be a waste of resources. To avoid this, TensorRT-LLM offers prefix caching, which can significantly reduce memory usage and computational load. For more details, see how to enable KV cache reuse in the NVIDIA/TensorRT-LLM GitHub repo.

This approach can considerably enhance TTFT and can be helpful for tasks with long input length, shared system prompts, and short output length. It works especially well for microreviews, because generating one microreview takes 40 multistep inferences on average, with each step sharing prefixes.

However, prefix caching may not work efficiently with reduced caching performance and higher management overhead for tasks involving highly diverse system prompts as caching is based on the least recently used (LRU) strategy.

Response caching

Response caching, a feature of NVIDIA Triton Inference Server, can help avoid inefficient, redundant inferencing. Triton accesses the response cache with a hash of the inference request that includes the model name, model version, and model inputs. Response caching works well except in cases where re-inferences are intentionally required, such as multinomial sampling decoding. In POI matching served in real time, four to five cache hits occur per second, meaning the computational load is reduced by 17%. For more details, see the Triton Response Cache documentation.

Chart of response cache hits for the POI matching service.
Figure 5. Response cache hits for the POI matching service. Image courtesy of NAVER

Serving TensorRT-LLM with Triton

An SLM engine built with TensorRT-LLM is served on Triton Inference Server. Triton provides features such as ensemble models and Business Logic Scripting (BLS) for composing a pipeline of tokenizing, postprocessing, or multistep inference, for example. NAVER Place chose to use BLS because it provides the flexibility needed for the specific use case. This section covers how NAVER Place maximized the advantages and usability of Triton BLS.

Improve usability with well-defined request/response schema

Triton models exchange data in pb_tensor format. The BLS structure chosen for communication efficiency and LLM inference optimization contains preprocessing and postprocessing code, which requires converting data type from pb_tensor to NumPy array and then back to pb_tensor.

This process has two difficulties. First, if the IO data of each model is not validated, it is challenging to debug because any invalid data formats or missing required fields are found at runtime. Second, the code becomes more complex as both preprocessing and postprocessing are combined into BLS, making it more difficult to extend and maintain if there are call dependencies between models. 

These challenges emphasize the need for a well-defined request/response schema to reduce runtime errors and streamline code management, particularly when multiple models have to be chained together. In addition, keeping the data formats consistently in the entire pipeline significantly alleviates debugging difficulties and ensures smoother integration. For example, POI matching goes through a complicated pipeline, as highlighted in Figure 6.

Diagram of POI matching inference pipeline in BLS.
Figure 6. POI matching inference pipeline in BLS. Image courtesy of NAVER

To overcome these difficulties, the NAVER Place team came up with the following approaches.

Standardizing IO schema management

Based on NVIDIA Python data classes, we defined IO schemas using Pydantic, which makes data validation easier. This helps ensure structural consistency across the requests and responses of all Triton models and enhance data validation. By employing a well-defined schema at this stage, developers can detect data issues early and can maintain consistent data structures throughout the inference pipeline, ultimately reducing debugging overhead and improving overall reliability.

For example, we defined a class named BlsRequest to manage the input data format of Triton requests and perform data validation, as shown in the following code example:

# NOTE: Because Triton uses pb_tensor and Numpy objects,
# it is required to declaratively manage the fields that are not defined as Python default types.
# For this, we added tTe json_schema_extra field of Pydantic to explicitly manage data types.
class BlsRequest(TritonFieldModel):
      name: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      subname: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      biznum: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      address: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      tel: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      @root_validator(pre=True)
      def check_all_fields_empty(cls, values):
          if not any(bool(v) for v in values.values()):
              raise ValidationError("All fields cannot be empty", model=cls.__name__)

Modularize IO type conversion by model

We encapsulated the IO data conversion process for each model and created a common function for conversion between pb_tensor and Pydantic, making it suitable for the base Triton Python model. This can help call models in a consistent way, without having to worry about the internal data conversion process.

The following code example is of a function that receives a Pydantic Request object, converts it to Triton pb_tensor, and then returns the result of a model inference as a Pydantic Response object:

def _infer_model(self, request_model, response_model_cls, model_name, request_model, **infer_kwargs):
     # Converts Pydantic Request to Triton pb_tensors.
     pb_tensors = self.convert_pydantic_to_pb_tensors(request_model, batch_inference)
     # Runs model inference.
     infer_request = pb_utils.InferenceRequest(
         model_name=model_name,
         inputs=pb_tensors,
         requested_output_names=response_model_cls.get_field_names(),
         **infer_kwargs,
     )
     infer_response = infer_request.exec()
     # Converts Triton Response(pb_tensors) to Pydantic Response.
     return self.convert_pb_tensors_to_pydantic(response, response_model_cls)

The following code example uses _infer_model to call a model. Simply declare the GeneratorRequest and GeneratorResponse classes and forget about the complicated data conversion or model invocation process.

def infer_generator(self, request, text_input, max_tokens):
      response_model_cls = schema.GeneratorResponse
      request_model = schema.GeneratorRequest(text_input=text_input, max_tokens=max_tokens)
       return self._infer_model(
          request=request,
          model_name="generator_bls",
          request_model=request_model,
          response_model_cls=response_model_cls,
      )

Modularizing the BLS business logic and enhance testability

The NAVER team modularized the business logic and preprocessing and postprocessing code in BLS in the following ways to achieve lower coupling. This can help make the code less complex and enhance testability and maintainability.

  • Modularize preprocessing and postprocessing and introduce unit testing
    • Modularized the business logic for model training and preprocessing and postprocessing code to make them reusable.
    • Designed test code to run independently in Python runtime, even without Triton runtime, enabling the validation of preprocessing and postprocessing for each model.
  • Redefine the roles of BLS
    • BLS is responsible only for model invocation and end-to-end testing. This can ensure the system to remain scalable, minimizing the impact on the BLS code even if new requirements are added.
  • Introduce CI
    • Created a CI test pipeline for the business logic and preprocessing and postprocessing code. This can help quickly verify changes made during the model training process to ensure they do not affect serving. Integrating these tests into a CI pipeline enables earlier issue detection and quick resolution, ensuring stable updates without disrupting the serving process.

Using this approach, we effectively achieved the goal to enhance data validation, code maintainability, and development productivity, leading to higher productivity in Triton-based LLM serving development.

Summary

NAVER Place has successfully optimized LLM engines using NVIDIA TensorRT-LLM and improved the usability of NVIDIA Triton Inference Server. Through this optimization, the team also maximized GPU utilization, further enhancing the overall system efficiency. The entire process has helped to optimize multiple SLM-based vertical services, making NAVER Place more user-friendly. Building on this experience, we will continue to develop additional vertical models and apply them to our services.

Get started with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server.

Discuss (0)

Tags