Developer Blog: How to Deploy Real-Time Text-to-Speech Applications on GPUs Using TensorRT

Discuss (0)

Conversational AI is the technology that allows us to communicate with machines like with other people. With the advent of sophisticated deep learning models, the human-machine communication has risen to unprecedented levels. However, these models are compute intensive, and hence require optimized code for flawless interaction. In this developer blog post, we’ll walk through how to convert a PyTorch model through ONNX intermediate representation to TensorRT 7  to speed up inference in one of the parts of Conversational AI – Speech Synthesis.

Conversational AI

A typical modern Conversational AI system comprises 1) an Automatic Speech Recognition (ASR) model, 2) a Natural Language Processing model (NLP) for Question Answering (QA) tasks, and 3) a Text-to-Speech (TTS) or Speech Synthesis network. A recently published technical blog describes how you can build domain specific ASR models on GPUs.

Figure 1. A typical pipeline of Conversational AI

A challenge for Conversational AI is that in order for the conversation to be natural, the machine has to respond promptly to human actions. When you talk with friends, their reactions to your comments or questions are instantaneous, and you probably expect similar responsiveness from the devices you use. This is challenging, since sequential signals such as waveform are difficult to parallelize during inference. This is the case for many of the state-of-the-art neural networks, including Tacotron 21, that use the aforementioned recurrent layers, or operate in an autoregressive manner, where the output signal is fed back to the input.

The utterances we speak and hear are sequential signals of varying duration. In the context of neural network applications, we define the variability of utterance length as variable-size input/output. A conversational AI system has to correctly handle this variability both on the system level and model level, and in the latter it typically processes the signals using recurrent layers, such as Long Short-Term Memory (LSTM) units.

> (opens in a new tab)”>Read the entire Developer Blog >>