Following recent breakthroughs in natural language processing, today NVIDIA announces new inference speedups for automatic speech recognition, natural language processing and text-to-speech with TensorRT 7. With this breakthrough, NVIDIA has now accelerated training and deployment of the entire conversational AI pipeline.
Conversational AI is comprised of three key components, with each requiring several models to be chained together:
- An automatic speech recognition (ASR) component
- A natural language processing component (NLP) for question answering (QA) tasks
- A text-to-speech (TTS) or speech synthesis component.
Additional models for translation and recommendations are generally required to generate a meaningful response for the user. For conversational AI to be interactive, it needs to work in real time, with latencies of under 300 ms.
The latest CUDA-X AI updates enable easy, real-time deployment of conversational AI applications. These updates include:
TensorRT
TensorRT 7 delivers over 10x faster performance versus CPU on conversational AI with support for speech recognition, natural language understanding as well as text-to-speech for smarter, more natural human-to-AI conversation.
Highlights in this release include:
- Support for recurrent neural networks with a new compiler
- Over 20 new ONNX ops supported to import speech models
- Dynamic Shapes expanded to easily optimize speech models
Examples available from NGC and the TensorRT GitHub repository
Neural Modules (NeMo) Toolkit
NeMo is a toolkit, based on PyTorch, created for building conversational AI models. Through modular deep neural networks development, NeMo enables fast experimentation by connecting modules, mixing and matching components.
Early collaborators are excited by the ease-of-use and flexibility that NeMo provides when building complex language models.
“The SpeechBrain project aims to create a single, flexible toolkit that makes it easy to develop state-of-the-art speech technologies. By partnering with NVIDIA, we look forward to incorporating seamless mixing and manipulation of deep learning components through the Neural Modules(NeMo) toolkit, which is pivotal in providing the flexibility and modularity needed to accelerate the development of speech applications.” – Yoshua Bengio, Founder and Scientific Director of Mila, ACM Turing Award Laureate, Professor University of Montreal
Get Started: New technical post demonstrates how to build domain specific ASR models using NeMo
More examples are also available from NGC and the NeMo GitHub repository
NVIDIA Riva Multimodal AI SDK
NVIDIA Riva is a software development kit for developing multi-modal AI apps that integrate computer vision with speech and other sensors to create a natural conversation between humans and computers. Apply for exclusive news, updates, and early access to NVIDIA Riva.