GTC 2020: XLNet Optimization Using CUDA
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
XLNet Optimization Using CUDA
Bo Yang Hsueh, NVIDIA | , Christina Zhang
XLNet, a generalized autoregressive pretraining method, achieved great results on several natural language processing tasks. Compared to the previous language model, XLNET has advantages like being able to process long sentences, and avoids the disadvantage of using special tokens. However, as far as we know, there still isn't proper performance optimization for XLNet using CUDA, which would demand more inference time and hinder XLNET's wide deployment. We first ran the performance analysis of XLNet using its Tensorflow code. Then we optimized XLNet with these aspects: 1. For relative positional encoding, we optimized its parallelization with the help of cuBlas; 2. We customized the corresponding self-attention architecture based on the attention code in FastTransformer; and 3. We used kernel fusion and other CUDA optimization strategies to speedup XLNet.