GTC Silicon Valley-2019 ID:S9535:Optimizing Runtime Performance of Neural Net Architectures for High Scalability
John Kominek(Voci Technologies)
Learn about the advantages and pitfalls of venturing away from off-the-shelf libraries to implement neural network inference algorithms from the ground up. We'll discuss the challenges of building large-vocabulary speech recognition engines able to support decoding more than 1,000 simultaneous conversations per NVIDIA V100 card, while still able to down-port onto low-memory embedded configurations such as the Tegra TK1. We'll cover what characteristics of the many types of popular neural networks used in speech recognition scale almost perfectly, as well as those that resist scaling and even scale negatively. Learn what profiling reveals about the silent, looming cost of kernel synchronization and what to do about it.