Edge Computing

Maximum Performance and Minimum Footprint for AI Apps with NVIDIA TensorRT Weight-Stripped Engines

Decorative image of TensorRT workflow on a black background.

Jun 11, 2024

By Gunjan Mehta, Xiaodong Huang, Michal Guzek and Maximilian Müller

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA TensorRT 10.0 introduces weight-stripped engines, which contain execution code without weights, achieving over 95% engine size compression.
Weight-stripped engines can be refitted with original weights from the ONNX model directly on the end-user device using the TensorRT 40MB lean runtime, with minimal latency and without affecting inference performance.
The weight-stripping functionality is integrated into ONNX Runtime and will be available starting from the ORT 1.18.1 release, enabling TensorRT to offer the same functionality through ORT APIs and reducing shipment sizes.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX GPUs. Now, deploying TensorRT into apps has gotten even easier with prebuilt TensorRT engines.

The newly released TensorRT 10.0 with weight-stripped engines offers a unique solution for minimizing the engine shipment size by reducing it to just the execution code, achieving >95% engine size compression.

In this post, we discuss how weight-stripped engines are built and how they can be refitted with weights directly on the end-user device using the TensorRT 40MB lean runtime.

What is a weight-stripped engine?

Introduced in TensorRT 10.0, weight-stripped engines contain execution code (CUDA kernels) without weights. Enabling weight-stripping during the build phase results in engines that are over 95% smaller than traditional ones, with only essential weights retained for performance optimizations.

These engines support ONNX models and other network definitions, offering similar benefits to refittable engines by allowing weight changes without rebuilding. During deserialization, they can be refitted with original weights from the model directly on the end-user device, with minimum latency and without affecting inference performance.

Why is weight-stripping necessary?

Traditionally, TensorRT engines included all weights of a network, resulting in redundant weights across hardware-specific engines. This necessitated shipping prebuilt weightful engines for the entire install base unless building engines directly on end-user devices.

Consider a scenario where a workstation AI app serves an installed base with M discrete GPU SKUs, creating P engines for various optimization profiles for each of N DL models. This leads to significant weight duplication (M*P) in the final application binary, which can be avoided by using weight-stripped engines.

Weight-stripped engines achieve over 95% engine size compression for CNN and LLM use cases, enabling the packaging of more AI functionality without growing the app size. These engines are version-compatible across TensorRT minor updates and can use the lean runtime of ~40 MB when built with the kVERSION_COMPATIBLE flag.

NVIDIA TensorRT Cloud, currently in early access for select partners, also offers the option to build weight-stripped engines on various NVIDIA GPUs. Support for building and refitting weight-stripped NVIDIA TensorRT-LLM engines is coming soon.

Building a weight-stripped engine

When building a weight-stripped engine locally, the TensorRT builder still requires the model weights for optimization decisions, ensuring consistent performance when refitted later compared to a normal or weightful engine.

Using real weights during building enables TensorRT to optimize computations by constantly folding static nodes and introducing fusion optimizations, for example, embedding GELU literals directly into a single CUDA kernel when they are not marked as refittable during engine building.

TensorRT Cloud also facilitates the creation of weight-stripped engines from ONNX models.

For more information, see the Get started section later in this post.

Deploying a weight-stripped engine

If your app ships weight-stripped engines, you can easily refit them with weights from the ONNX file on the end-user device, within seconds. After being serialized back, refitted engines eliminate recurring refit costs, while maintaining the quick deserialization efficiency that TensorRT is known for.

You can use the refit functionality without needing the entire TensorRT builder by shipping the lean runtime (~40 MB in TensorRT 10.0). Refitted weight-stripped engines retain version forward compatibility (VFC) and hardware forward compatibility (HFC) benefits, enabling them to run on next-generation GPUs without app updates.

Case study

We achieved >99% compression with SDXL in demoDiffusion on an NVIDIA GeForce RTX 4090 GPU (Table 1).

SDXL fp16	Full engine size (MB)	Weight-stripped engine size (MB)
clip	237.51	4.37
clip2	1329.40	8.28
unet	6493.25	58.19

Table 1. Compression comparison for SDXL fp16

While the support for weight-stripped TensorRT-LLM engines is coming soon, here are some measurements from an internal build on an NVIDIA GeForce RTX 4090 GPU, achieving >99% compression on several LLMs (Table 2).

Model	Full engine size (MB)	Weight-stripped engine size (MB)
chatglm-6b int4	3831.09	4.50
chatglm-6b int8	6470.10	34.12
gemma2.5b int4	2987.81	6.62
gemma2.5b int8	3905.64	26.03
gemma7b int4	6826.14	7.86
gemma7b int8	10409.58	54.07
llama2-7b fp16	12856.44	4.04
llama2-7b int4	3691.12	5.41
llama2-7b int8	6683.72	41.54
mistral fp16	13831.44	4.06
mistral int4	3954.45	8.12
mistral int8	7178.73	41.91
phi2 fp16	5304.53	4.04

Table 2. Compression comparison for LLMs

Even on a data center GPU such as NVIDIA H100, you can see similar compression on TensorRT-LLM Llama models (Table 3).

Model	Full engine size (MB)	Weight-stripped engine size (MB)
llama-7b fp16 + WoQ int8	6704.55	28.69
llama2-70b fp8 + TP=2	66341.72	60.61

Table 3. Compression comparison for Llama models

Get started

In TensorRT 10.0, a new flag, kREFIT_IDENTICAL, optimizes the builder under the assumption that the engine will be refitted with identical weights. When used with kSTRIP_PLAN, it minimizes the engine size to the best extent possible since the use of all identical weights during build-time and refit-time ensures maximum performance.

A new serialization flag, SerializationFlag::kEXCLUDE_WEIGHTS, also permanently saves refitted weight-stripped engines to disk when set during the refit process, avoiding recurring refit costs during future app launches. When unset, it enables continuous refitting with new weights using the refit API.

Here are instructions for a full weight-stripped and refit workflow:

Build the weight-stripped engine.
Refit the weight-stripped engine from INetworkDefinition.
Refit the weight-stripped engine from the ONNX parser.
Serialize as a full-weight engine.

Build the weight-stripped engine

Set the corresponding builder flag to enable the weight-stripped build.

config->setFlag(BuilderFlag::kSTRIP_PLAN);
config->setFlag(BuilderFlag::kREFIT_IDENTICAL);
builder->buildSerializedNetwork(*network, *config);

config.flags |= 1 << int(trt.BuilderFlag.STRIP_PLAN)
config.flags |= 1 << int(trt.BuilderFlag.REFIT_IDENTICAL)
builder.build_serialized_network(network, config)

After the engine is built, save the engine plan file and distribute it with an app installer.

Refit the weight-stripped engine

Depending on how your network was specified, the TensorRT INetworkDefinition API or ONNX Runtime (ORT), see the relevant refit instructions.

Refit the weight-stripped engine from INetworkDefinition

Create a refitter from the engine.

auto refitter = std::unique_ptr<nvinfer1::IRefitter>(
            	nvinfer1::createInferRefitter(*engine, logger));

refitter = trt.Refitter(engine, TRT_LOGGER)

On the client, when you launch the network for the first time, update all the weights in the engine. Here, use getAllWeights, as all the weights in the engine plan were removed.

int32_t const nbWts = refitter->getAllWeights(0, nullptr);
std::vector<char const*> allWtsNames(nbWts);
refitter->getAllWeights(nbWts, allWtsNames.data());

all_weights_names = refitter.get_all_weights()

Update the weights one by one.

for (int32_t i = 0; i < nbWts; ++i)
    refitter->setNamedWeights(allWtsNames[i], Weights{...});

for name in all_weights_names:
    refitter.set_named_weights(name, trt.Weights(...))

Refit the weight-stripped engine from the ONNX parser

Create a refitter from the engine.

auto refitter = std::unique_ptr<nvinfer1::IRefitter>(
     nvinfer1::createInferRefitter(*engine, logger));

refitter = trt.Refitter(engine, TRT_LOGGER)

Create a refit parser.

auto parser_refitter = std::unique_ptr<nvonnxparser::IParserRefitter>(
            	nvonnxparser::createParserRefitter(*refitter, logger));

parser_refitter = trt.OnnxParserRefitter(refitter, TRT_LOGGER)

Refit the engine from the original ONNX file.

if (parser_refitter->refitFromFile(onnx_model_path.c_str())) {
    assert(refitter->refitCudaEngine());
}

parser_refitter.refit_from_file(onnx_model_path)
refitter.refit_cuda_engine()

Serialize as a full-weight engine

Save the full engine plan file.

auto serializationConfig = std::unique_ptr<ISerializationConfig>(engine->createSerializationConfig());
auto serializationFlag = serializationConfig->getFlags();
serializationFlag &= ~(1<< static_cast<uint32_t>(SerializationFlag::kEXCLUDE_WEIGHTS));
serializationConfig->setFlags(serializationFlag);
auto hostMemory = std::unique_ptr<nvinfer1::IHostMemory>(engine->serializeWithConfig(*serializationConfig));

serialization_config = engine.create_serialization_config()
serialization_config.flags &= ~(1 << int(trt.SerializationFlag.EXCLUDE_WEIGHTS))
binary = engine.serialize_with_config(serialization_config)

The application can now use the new full engine plan file for future inference.

Limitations

In TensorRT 10.0, weight-stripped functionality is limited to refitting with identical build-time weights to ensure correct functionality and maximum performance. Differences between refit-time and build-time weights may result in undefined behavior. The builder controls which weights are stripped, and users cannot make layer-level decisions. This limitation may be lifted in future releases to enable finer control.

Support for weight-stripped engines in TensorRT-LLM will be available in upcoming releases, with the same requirement of using identical build-time weights to ensure correct functionality.

Integration with ONNX Runtime

The TensorRT 10.0 weight-stripped functionality has been integrated into ORT and will be available starting from the ORT 1.18.1 release. ORT offers a standard inference platform across multiple hardware providers within the Windows ecosystem. This integration enables TensorRT to offer the same functionality through ORT APIs, reducing shipment sizes when catering to diverse customer hardware.

In ORT, weight-stripped functionality uses the same EP context node-based logic that enables embedding serialized TensorRT engines within an ONNX model to bypass builder instantiation. This approach avoids shipping builder resources and significantly reduces TensorRT EP session setup time. For more information, see ONNX Runtime: Accelerated AI Deployment for PC Apps (GTC session).

You can now provide a weight-stripped engine instead of a full engine. To do this, create a TensorRT engine with stripped weights (to be refitted later) using the Boolean trt_weightless_engine_enable TensorRT EP session option along with the existing trt_dump_ep_context_model flag.

The resulting ONNX EP context metadata model contains the filename of the original ONNX model needed to refit the weights. This means that the builder library is no longer required for shipment. It’s sufficient to ship the original ONNX model for its weights and an embedded weight-stripped engine along with the main nvinfer library.

On the next TensorRT EP invocation for that particular model, with trt_weightless_engine_enable and trt_dump_ep_context_model still enabled, the engine will be refitted with the original weights.

Conclusion

TensorRT weight-stripped engines empower you to incorporate extensive AI functionality into your apps without worrying about increasing app sizes, all while leveraging TensorRT peak performance on NVIDIA GPUs.

On-device refitting enables continuous updates with improved weights without the need for engine rebuilding. Weight-stripping and refitting support are on the horizon for TensorRT-LLM, offering a novel approach to generating, deploying, and maintaining generative AI models of the future.

For more information about an end-to-end weight-stripping sample for ONNX models, see the /NVIDIA/TensorRT GitHub repo or the /NVIDIA/TensorRT notebook.

Discuss (0)

About the Authors

About Gunjan Mehta
Gunjan Mehta is a senior product manager for the Deep Learning Inference Platform SW at NVIDIA. He focuses on bringing TensorRT-accelerated inference to NVIDIA RTX GeForce laptops and desktops and edge devices like embedded and DRIVE platforms. In his 10+ years of experience at NVIDIA, he has spent most of his time as a compiler engineer for the NVIDIA Deep Learning Inference Accelerator ASIC. He received his M.Sc. in computer engineering from Carnegie Mellon University in 2013, where he specialized in systems software engineering.

View all posts by Gunjan Mehta

About Xiaodong Huang
Xiaodong Huang is a senior deep learning software engineer on the NVIDIA TensorRT team. He specializes in GPGPU software architecture, accelerating deep learning inference, and system software development. Xiaodong holds an M.S. in software engineering from Fudan University.

View all posts by Xiaodong Huang

About Michal Guzek
Michal Guzek is a senior inference systems software engineer on the Deep Learning Inference Workflows team at NVIDIA. He has been working on building graph parsers and tools for the effective deployment of trained deep learning models. He received an M.Sc. in computer science from University of California, Irvine in 2021.

View all posts by Michal Guzek

About Maximilian Müller
Maximilian Müller is a developer technology engineer for Professional Visualization at NVIDIA. His passion for computer vision and deep learning helps external partners and internal teams to best leverage the performance of NVIDIA accelerators using CUDA. Before joining NVIDIA, he acquired an MSc in electrical and computational engineering at RWTH Aachen.

View all posts by Maximilian Müller