Getting a Real Time Factor Over 60 for Text-To-Speech Services Using NVIDIA Jarvis

A diagram of the Jarvis Server showing that a TTS pipeline is one of several components, and is composed of the Tacotron2 and WaveGlow networks.
Figure 1. The Jarvis Server and the TTS pipeline.

NVIDIA Jarvis is an application framework that provides several pipelines for accomplishing conversational AI tasks. Generating high-quality, natural-sounding speech from text with low latency, also known as text-to-speech (TTS), can be one of the most computationally challenging of those tasks.

In this post, we focus on optimizations made to a TTS pipeline in Jarvis, as shown in Figure 1. For more information about the Jarvis Server, see Introducing Jarvis: Framework for GPU-Accelerated Conversational AI Applications.

This TTS model is composed of the Tacotron2 network, which maps character sequences to mel-scale spectrograms, followed by the NVIDIA WaveGlow network, which generates time-domain waveforms from the mel-scale spectrograms. For more information about  the networks, as well as how to train them using PyTorch, see Generate Natural Sounding Speech from Text in Real-Time.

Our goal in creating the Jarvis TTS pipeline, was to enable conversational AIs to respond with natural sounding speech in as little time as possible, making for an engaging user experience. Below, we detail the effort of creating a high-performance, TTS, inference implementation, using TensorRT and CUDA.

In a previous post, How to Deploy Real-Time Text-to-Speech Applications on GPUs using TensorRT, you learned how to import a TTS model from PyTorch into TensorRT, to perform faster inference with minimal effort. For this implementation, we wanted to get the lowest latency TTS inference that we could. To accomplish this, we made several decisions, that while they require more effort, result in additional performance. The implementation discussed in this post is available as part of the NVIDIA Deep Learning Examples GitHub repository.

Creating the networks in TensorRT

To start with, we used the C++ TensorRT interface rather than the Python bindings. This helped to reduce the amount of overhead time needed on the CPU to coordinate and launch work on the GPU. For more information about creating and running networks with the C++ API, see Using the C++ API. This is particularly important in Tacotron2, where we must launch a network execution to generate each mel-scale spectrogram frame, of which there are roughly 86 per second of audio.

Creating Tacotron2 using IBuilder

In order to gain more flexibility in the Tacotron2 network, instead of parsing the exported ONNX model and having TensorRT automatically create the network, we manually constructed the network via the IBuilder interface of TensorRT. This enabled us to make several modifications, including allowing variable length sequences in the same batch to be processed.

To build the network manually, we first needed an easy way to get the weights from PyTorch to C++. For ease of use and readability, we used a single-level JSON structure. Another option would be to get the weights using the PyTorch C++ API. To export the Tacotron2 PyTorch model to JSON, use the following command:

statedict = dict(torch.load(statedict_path)["state_dict"])

outdict = {}

for k, v in dict(statedict).items():

    if k.startswith("module."):

        k = k[len("module."):]

    outdict[k] = v.cpu().numpy().tolist()

with open(json_path, "w") as fout:

    json.dump(outdict, fout)

To get the weights into a format that the TensorRT C++ API can consume, we created the JSONModelImporter and LayerData classes, which handle reading and storing the weights as Weights. These can be passed into TensorRT layers.

As in the ONNX-parsed implementation of Tacotron2, we split it up into three subnetworks: the encoder, decoder, and post-net, which we created with the EncoderBuilder, DecoderPlainBuilder, and PostNetBuilder classes, respectively.

At the start of the build methods, we created the INetworkDefinition object from IBuilder, and also added inputs.

Encoder network

The encoder network consists of an embedding layer, followed by convolution layers with activations, and ends with a bidirectional LSTM. Because the encoder is only run one time during inference, it tends to take a small fraction of the runtime, less than five percent in most cases.

While the bidirectional LSTM in TensorRT allows setting individual sequence lengths in batches, we needed to prevent the padding from impacting the convolution layers. To do this, we added a second input of a mask vector, which is all ones for the length of the sequence, followed by all zeros for the length of the padding. Then, on the output of each convolution layer, an element-wise multiplication with this mask is performed to reset the padding to zeros of each item in the batch.

Decoder network

The decoder network is run one time for every mel-scale spectrogram frame produced, which in this configuration is roughly 86 times for every second of output audio, until the stop condition is met. It is by far the most expensive part of the Tacotron2 network. It consists of a prenet, location sensitive attention (LSA), and LSTM cells, and ends with projection layers.

To amortize the cost of synchronously copying the stop token value back to the CPU, a fixed number of decoder iterations are run before reading the value of the accumulated stop tokens. This amortizes the cost of synchronously copying data from GPU to CPU over a larger amount of work.

All of the random numbers needed for the dropouts in the prenet are generated at the start of each group of decoder iterations. Generating many random numbers at one time is more efficient than repeatedly generating few random numbers in separate kernels.

The LSA is the only part of the decoder network that makes use of the padded input. As such, the same mask vector from the encoder  is re-used to reset that padding to zero after the convolution layer. The TensorRT IRaggedSoftMaxLayer layer, allows the soft-max to be performed on variable sequence lengths within a batch.

We merged the two fully connected layers at the end of the network, the 80×1536 layer for generating the mel-scale spectrogram frame and the 1×1536 layer for creating the stop token, into a single 81×1536 layer. This helps to increase parallelism and reduce the number of kernels launched.

Post-net network

The post-net network consists of five convolution layers followed by activations. To get good performance out of it, we input a fixed number of mel-spectrograms, to allow TensorRT to choose the optimal configuration with which to run it. In this configuration, the post-net network tends to take a small fraction of the runtime, less than five percent in most cases.

Creating WaveGlow using the ONNX parser

Unlike Tacotron2, WaveGlow does not need any internal modification to work properly on variable length sequences when using batching. For this portion of our speech pipeline, we followed the nominal PyTorch to ONNX and ONNX to TensorRT path. To export the model, we used torch.onnx.export and marked the batch dimension as dynamic.

torch.onnx.export(model, (mels, z),
                  output_path,
                  dynamic_axes={'mels': {0: 'batch_size'},
                                'audio': {0: 'batch_size'}},
                  input_names=['mels', 'z'],
                  output_names=['audio'],
                  opset_version=10)

Because the potential input/output length for the WaveGlow network is highly variable, from several dozen mel-scale spectrograms to hundreds, we use a static dimension for the input/output length of two seconds, and cut the output down to the appropriate size for shorter chunks. This allows TensorRT to optimize for a specific size, while allowing us to process the variable length sequences encountered in TTS.

We did not specify ‘z’ as having a dynamic batch size. Because we are only doing inference, we broadcast the same random ‘z’ across all items in the batch but generating a new ‘z’ for each chunk.

Optimizing performance further

We used NVIDIA Nsight Systems to gain more detailed insight into how our TTS implementation is using the hardware. Figures 2 and 3 show the GPU utilization profiles of Tacotron2 and WaveGlow, respectively.

A screenshot of an NVIDIA Nsight Systems profile of Tacotron2, showing gaps between kernels running on the GPU.
Figure 2. Tacotron2 GPU utilization.

A screenshot of an NVIDIA Nsight Systems profile of WaveGlow, showing that there is no gaps between kernels running on the GPU.

Figure 3. WaveGlow GPU utilization.

We can see that this Tacotron2 implementation is CPU bound, letting the GPU go idle in between its many small kernel launches. This is because many of the kernels take just a few microseconds to complete, and the CPU cannot generate work fast enough for the GPU.

WaveGlow on the other hand, with its larger kernels, gets good utilization out the GPU, and the CPU can generate work fast enough to keep the GPU busy.

Using custom plugins for the Tacotron2 decoder loop

The low utilization of the GPU in Tacotron2 begins to fade away when the batch size is increased, and with it the amount of work per kernel. The many bandwidth-bound general matrix-vector (GeMV)-like operations become more compute-heavy, general matrix multiply (GeMM)-like operations. This meant that we could focus performance improvements on the batch size of one case, which is common in TTS applications.

Figure 4. The two paths of execution taken when Tacotron2 is run for a single input sequence or a batch with multiple sequences.

To achieve maximum performance in the Tacotron2 network, we replaced many of the small layers in the decoder with custom layers, using the TensorRT IPluginV2DynamicExt interface. This gave us the ability to perform several low-level optimizations, while still letting TensorRT do the heavy lifting for other parts of our network. We built the decoder network with plugins in the DecoderBuilderPlugins class.

We split the decoder iteration into four separate plugins:

  • Prenet
  • LSTM cell
  • Attention
  • Projection

We changed our C++ backend, to switch between the decoder without plugins for batch sizes greater than one, and our decoder with plugins for a batch size of one, as shown in Figure 5.

A diagram of the Tacotron2 decoder with plugins, showing that the order of execution is prenet, LSTM cell, attention, LSTM cell, and ends with projection.
Figure 5. How the four plugins make up the decoder for batch size one.
An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration without plugins, showing low GPU utilization as the CPU is unable to generate work fast enough for the GPU.
Figure 6. The Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the CPU work of the iteration shown in light blue along the top, and GPU work shown in light blue at the bottom.
An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration using plugins, showing high GPU utilization as the CPU is able to generate work for the GPU fast enough.
Figure 7. The Nsight Systems profile of a Tacotron2 decoder iteration using plugins, with the CPU work of the iteration shown in light blue along the top, and GPU work shown in light blue at the bottom.

Figure 6 shows the profile of a Tacotron2 decoder iteration before we implemented any custom plugins. The CPU work required to launch many little kernels is the bottleneck, and you can see many gaps in GPU work at the bottom.

Figure 7 shows the profile of a Tacotron2 decoder iteration using the custom plugins. The CPU is only working on launching kernels for the iteration for half of the time that it takes the GPU to run them. This keeps the GPU busy, and allows the CPU to move on to queuing up work for the next iterations.

Our work in fusing kernels and reducing CPU overhead led to nearly a 10x reduction in the CPU time required to launch an iteration, and a 5x reduction in GPU time. Because these custom layers are specific to our network, we were able to do low-level CUDA optimizations. This includes using template parameters as layer dimensions, which lets the compiler perform aggressive optimizations, as well as avoiding bounds checking by using block sizes which are a multiple of the weight dimensions.

Prenet plugin

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the prenet highlighted, showing many kernels being run for the computation.
Figure 8. The Nsight Systems profile of a Tacotron2 decoder iteration using no plugins, with the computation of the prenet outlined in red. The top two boxes are the time taken on the CPU launching the kernels, and the red boxes at the bottom are the time taken by the kernels on the GPU.

Originally, the two layers of the prenet were implemented as a fully connected layer, followed by an element-wise multiplication against random vectors of 0 and (1/1-p) to emulate the dropout, and finally a rectified linear unit (ReLU) activation. This resulted in three kernels per layer, as can be seen in Figure 8.

The plugin for the prenet uses a CUDA kernel that fuses the fully connected layer with the dropout and subsequent ReLU. It is implemented in the file src/trt/plugin/taco2PrenetPlugin/taco2PrenetKernel.cu. We call this kernel twice to handle both layers in the prenet. This reduces six kernel calls to two, as seen in Figure 9.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration with plugins, with the prenet highlighted, showing two kernels being run for the computation.
Figure 9. The Nsight Systems profile of a Tacotron2 decoder iteration using plugins, with the computation of the prenet outlined in red. The top two boxes are the time taken on the CPU launching the kernels, and the red boxes at the bottom are the time taken by the kernels on the GPU.

LSTM cell plugin

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the LSTM cells highlighted, showing many kernels being run for the computation.
Figure 10. The Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the computation of the two LSTM cells outlined in red. The top two boxes are the time taken on the CPU launching the kernels, and the red boxes at the bottom are the time taken by the kernels on the GPU.

The plugin for the LSTM cells fuses a concat operation on the input with the LSTM cell computation. Our LSTM cell kernel is implemented in the file src/trt/plugin/taco2LSTMCellPlugin/taco2LSTMCellKernel.cu.

While the NVIDIA CUDA Deep Neural Network library (CUDNN) implementation, which creates new streams, and launches GeMVs on each stream to maximize parallelism, works well for LSTMs with a sequence length greater than one, the overhead for stream creation, and multiple kernel launches do not amortize well enough for our purposes, as can be seen in Figure 10. As such, our custom kernel loads the two separate inputs, performs both GeMVs and adds both biases, and computes the hidden and cell states. Doing this all in one kernel also reduces the amount of reads and writes that we need to make to global memory.

To avoid being memory-bound when  loading the weights from global memory, we store the weights as FP16 but still perform the computation in FP32 to preserve accuracy enough for the LSA. Figure 11 shows the effects of these optimizations, where we have a single kernel per LSTM cell.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration with plugins, with the LSTM cells highlighted, showing one kernel per cell being run for the computation.
Figure 11. The Nsight Systems profile of a Tacotron2 decoder iteration using plugins, with the computation of the LSTM cells outlined in red. The top two boxes are the time taken on the CPU launching the kernels, and the red boxes at the bottom are the time taken by the kernels on the GPU.

Attention plugin

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the attention highlighted, showing many kernels being run for the computation.
Figure 12. The Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the computation of the Attention outlined in red. The top three boxes are the time taken on the CPU launching the kernels, and the three red boxes at the bottom are the time taken by the kernels on the GPU.
A diagram showing that the element-wise layers in the attention are fused into two kernels with most of the FC layers remaining unfused.
Figure 13. How the Attention plugin is implemented across six kernels.

The Attention plugin is the most complex of the four plugins, making use of six CUDA kernels. Figure 13 shows the division of the operations into those CUDA kernels.

We made use of the high-performance NVIDIA cuBLAS library GeMV and GeMM implementations for the FC and MM layers. For the convolution layer, we used a custom CUDA kernel that had the number of channels and kernel size hard-coded for the Tacotron2 network.

We fused the first set of element-wise sums and activation with the FC layer before the SoftMax. Then, we fused the SoftMax with the element-wise summation required for the accumulation of the Attention weights.

Figure 14 shows the result of the work on this plugin, where launching the six kernels takes only 29 microseconds.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration with plugins, with the attention highlighted, showing six kernels being run for the computation.
Figure 14. The Nsight Systems profile of a Tacotron2 decoder iteration using plugins, with the computation of the Attention outlined in red. The top box is the time taken on the CPU launching the kernels, and the red box at the bottom is the time taken by the kernels on the GPU.

Projection plugin

The projection plugin is relatively simple, fusing a concatenation with a fully connected layer. Because this layer has only 81 rows in the weight matrix, we used one row per thread block, in order to ensure the kernel could occupy as many SMs as possible, and thus make the most use out of the available memory bandwidth when loading the weights.

Figure 15 shows the concatenation and projection used to require several memory operations and kernel launches (the GeMV plus some reformatting). In our plugin, this is now reduced to a single kernel, as shown in Figure 16.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the projection highlighted, showing several kernels being run for the computation.
Figure 15. The Nsight Systems profile of a Tacotron2 decoder iteration without plugins, with the computation of the projection outlined in red. The top box is the time taken on the CPU launching the kernels, and the red box at the bottom is the time taken by the kernels on the GPU.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration with plugins, with the projection highlighted, showing one kernel being run for the computation.

An NVIDIA Nsight Systems profile of a Tacotron2 decoder iteration with plugins, with the projection highlighted, showing one kernel being run for the computation.
Figure 16. The Nsight Systems profile of a Tacotron2 decoder iteration using plugins, with the computation of the Projection outlined in red. The top box is the time taken on the CPU launching the kernel, and the red box at the bottom is the time taken by the kernel on the GPU.

Demonstrating end-to-end performance

Figures 17 and 18 compare the end-to-end inference performance of the Tacotron2 and WaveGlow TTS pipeline. Latency is measured from the start of Tacotron2 to the end of WaveGlow. All implementations were given the same 128-character input sequence, and generated between 6.4 and 7.0 seconds of audio.

The variation in output length is due to the random dropouts in the prenet, and small numerical differences in the implementations affecting the LSA. The PyTorch FP32 CPU run was generated on an Intel Xeon E5-2698 v4, and then others were generated on an NVIDIA V100 GPU.

A diagram comparing latency of the different TTS impementations, with the TensorRT C++ implementation with plugins having the lowest latency (.20).
Figure 17. The performance of the different Tacotron2 and WaveGlow implementations in terms of latency (smaller is better), when given a 128-character input sequence, using TensorRT 7.0.

The implementation using the TensorRT C++ API with plugins for the decoder loop (TRT+CPP Plugins), delivers 6.7 seconds of audio in 200 milliseconds. We made use of the TensorRT performance-critical features, and better use of the power of the GPU.

In Figure 18, the real-time factor (RTF) is the latency divided by the wall clock time, the number of seconds of audio generated per second. An implementation with an RTF of 1.0 generates audio in real time. On the V100 using TensorRT 7.0, this optimized implementation achieved 33.7x faster than real-time, end-to-end, TTS inference.

A diagram showing the RTF of the TTS implementations, with the TensorRT C++ implementation with plugins being the fastest (33.7x increase).
Figure 18. The performance of the different Tacotron2 and WaveGlow implementations in terms of RTF (larger is better), using TensorRT 7.0.

New A100 GPU and TensorRT 7.1

With the newly announced NVIDIA A100 GPU and TensorRT 7.1, the performance of the TTS pipeline gets even better.

A diagram showing the RTF of TTS on T4, V100, and A100 on TensorRT 7.0 and 7.1. The A100 has a 61.4x increase.
Figure 19. The performance of the TensorRT C++ with the plugin TTS implementation, comparing TensorRT 7.0 and TensorRT 7.1 across different hardware.

Figure 19 shows that, on A100 with TensorRT 7.1, the TTS pipeline achieves a blisteringly fast 61.4x RTF, generating 7.3 seconds of natural-sounding speech in less than 120 milliseconds. This high-performance speech synthesis is a crucial component of conversational AI, giving users immediate answers to their questions, creating a truly engaging experience. The performance improvements made in TensorRT 7.1 further improve the performance on the NVIDIA T4 GPU and V100 platforms.

Running the TTS pipeline

This standalone TTS pipeline is available as part of the NVIDIA Deep Learning Examples repository. You can check it out and run it yourself with the following instructions.

Clone the repository

First clone the NVIDIA Deep Learning Examples repository, and navigate to the Tacotron2 C++ directory:

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trtis_cpp

Export the models

You can either train models yourself, or download pretrained checkpoints from NVIDIA NGC and copy them to the ./checkpoints directory:

mkdir checkpoints
cp <Tacotron2_checkpoint> ./checkpoints/
cp <WaveGlow_checkpoint> ./checkpoints/

Next, export the PyTorch checkpoints so that they can be used to build TensorRT engines. This can be done using the script export_weights.sh script:

mkdir models
./export_weights.sh checkpoints/tacotron2_1032590_6000_amp checkpoints/waveglow_1076430_14000_amp models/

Set up the Triton server

To build the Docker container for the Triton server, run the build_trtis.sh script:

./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json

This takes some time as TensorRT tries out different tactics for best performance while building the engines.

Set up the Triton client

Next, build the client Docker container. To do this, enter the trtis_client directory and run the script build_trtis_client.sh:

cd trtis_client
./build_trtis_client.sh
cd ..

Run the Triton server

To run the server locally, use the script run_trtis_server.sh:

./run_trtis_server.sh

To set which GPUs the Triton server sees, use the environment variable NVIDIA_VISIBLE_DEVICES.

Run the Triton client

Leave the server running. In another terminal window, type:

cd trtis_client/
./run_trtis_client.sh phrases.txt

This generates one WAV file per line in the file phrases.txt, named after the line number (for example, 1.wav through 8.wav for an 8-line file) in the audio/ directory. It is important that each line in the file ends with a period, or Tacotron2 may fail to detect the end of the phrase.