After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Bo Yang Hsueh, NVIDIA
Recently, models such as BERT and XLNet, which adopt a stack of transformer layers as key components, show breakthrough performance in various deep learning tasks. Consequently, the inference performance of the transformer layer greatly limits the possibility that such models can be adopted in online services. First, we'll show how Faster Transformer optimizes the inference computation of both the transformer encoder and decoder layers. In addition to optimizations on the standard transformer, we'll get into how to customize Faster Transformer to accelerate a pruned transformer encoder layer together with the CUTLASS library.