GTC 2020: XLNet Optimization Using CUDA

After clicking “Watch Now” you will be prompted to login or join.

Click “Watch Now” to login or join the NVIDIA Developer Program.

WATCH NOW

XLNet Optimization Using CUDA

Bo Yang Hsueh, NVIDIA | , Christina Zhang

GTC 2020

XLNet, a generalized autoregressive pretraining method, achieved great results on several natural language processing tasks. Compared to the previous language model, XLNET has advantages like being able to process long sentences, and avoids the disadvantage of using special tokens. However, as far as we know, there still isn't proper performance optimization for XLNet using CUDA, which would demand more inference time and hinder XLNET's wide deployment. We first ran the performance analysis of XLNet using its Tensorflow code. Then we optimized XLNet with these aspects: 1. For relative positional encoding, we optimized its parallelization with the help of cuBlas; 2. We customized the corresponding self-attention architecture based on the attention code in FastTransformer; and 3. We used kernel fusion and other CUDA optimization strategies to speedup XLNet.

View More GTC 2020 Content