After clicking “Watch Now” you will be prompted to login or join.
Faster Transformer
Bo Yang Hsueh, NVIDIA
GTC 2020
Recently, models such as BERT and XLNet, which adopt a stack of transformer layers as key components, show breakthrough performance in various deep learning tasks. Consequently, the inference performance of the transformer layer greatly limits the possibility that such models can be adopted in online services. First, we'll show how Faster Transformer optimizes the inference computation of both the transformer encoder and decoder layers. In addition to optimizations on the standard transformer, we'll get into how to customize Faster Transformer to accelerate a pruned transformer encoder layer together with the CUTLASS library.