This post was updated from May 2020.
Speech AI is used in a variety of applications, including call centers for empowering human agents, speech interface for virtual assistants, and live captioning in video conferencing. Speech AI includes automatic speech recognition (ASR) and text-to-speech (TTS). The ASR pipeline takes raw audio and converts it to text, and the TTS pipeline takes the text and converts it to audio.
Developing and running real-time speech AI services is complex and difficult. Building speech AI applications requires hundreds of thousands of hours of audio data, tools to build and customize models based on your specific use case, and scalable deployment support. It also means running in real-time, with far under 300 milliseconds to interact naturally with users. NVIDIA Riva streamlines the end-to-end process of developing speech AI services and provides real-time performance for human-like interactions.
NVIDIA Riva is a GPU-accelerated SDK for developing speech AI applications. Riva is designed to help you access speech AI functionalities easily and quickly. With a few commands, you can access the high-performance services through API operations and try demos.
The Riva SDK includes pretrained speech AI models, the NVIDIA TAO Toolkit for fine-tuning these models on a custom dataset, and optimized end-to-end skills for automatic speech recognition and speech synthesis.
Using Riva, you can easily fine-tune state-of-art models on your data to achieve a deeper understanding of their specific contexts. Optimize for inference to offer real-time services that run in 150 milliseconds (ms) compared to the 25 seconds required on CPU-only platforms.
Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR and TTS. These AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pre-trained models or fine-tune them with your own dataset to further improve model performance.
Riva uses NVIDIA Triton Inference Server to serve multiple models for efficient and robust resource allocation and to achieve high performance in terms of high throughput, low latency, and high accuracy.
Overview of Riva skills
Riva provides highly optimized automatic speech recognition and speech synthesis services for use cases like real-time transcription and virtual assistants. The automatic speech recognition skill is available in English, Spanish, German and Russian. It is trained and evaluated on a wide variety of real-world, domain-specific datasets. With telecommunications, podcasting, and healthcare vocabulary, it delivers world-class production accuracy.
The Riva text-to-speech or speech synthesis skill generates human-like speech. It uses non-autoregressive models to deliver 12x higher performance on NVIDIA A100 GPUs compared to Tacotron 2 and WaveGlow models on NVIDIA V100 GPUs. Furthermore, with TTS you can create a natural custom voice for every brand and virtual assistant with only 30 mins of an actor’s voice data.
To take full advantage of the computational power of the GPUs, Riva skills uses NVIDIA Triton Inference Server to serve neural networks and ensemble pipelines to run efficiently with NVIDIA TensorRT.
Riva services are exposed through API operations accessible by gRPC endpoints that hide all the complexity. Figure 3 shows the system’s server-side. The gRPC API operations are exposed by the API server running in a Docker container. They are responsible for processing all the speech incoming and outgoing data.
The API server sends inference requests to NVIDIA Triton and receives the results.
NVIDIA Triton is the backend server that simultaneously processes multiple inference requests on multiple GPUs for many neural networks or ensemble pipelines.
It is crucial for speech AI applications to keep the latency below a given threshold. This latency requirement translates into the execution of inference requests as soon as they arrive. To make the best use of GPUs to increase performance, you should increase the batch size by delaying the inference execution until more requests are received, forming a bigger batch.
NVIDIA Triton is also responsible for the context switch of networks with the state between one request and another.
Riva can be installed directly on bare-metal through simple scripts that download the appropriate models and containers from NGC, or it can be deployed on Kubernetes through a Helm chart, which is also provided.
Here’s a quick look at how you can interact with Riva. A Python interface makes communication with a Riva server easier on the client side through simple Python API operations. For example, here’s how a request for an existing TTS Riva service is created in three steps.
First, import the Riva API:
import src.riva_proto.riva_tts_pb2 as rtts import src.riva_proto.riva_tts_pb2_grpc as rtts_srv import src.riva_proto.riva_audio_pb2 as ri
Next, create a gRPC channel to the Riva endpoint:
channel = grpc.insecure_channel('localhost:50051') riva_tts = rtts_srv.RivaSpeechSynthesisStub(channel)
Then, create a TTS request:
req = rtts.SynthesizeSpeechRequest() req.text = "We know what we are, but not what we may be?" req.language_code = "en-US" req.encoding = ri.AudioEncoding.LINEAR_PCM req.sample_rate_hz = 22050 req.voice_name = "ljspeech" resp = riva_tts.Synthesize(req) audio_samples = np.frombuffer(resp.audio, dtype=np.float32)
Customizing a model with your data
Using the NVIDIA TAO Toolkit, you can use a custom-trained model in Riva (Figure 4). NVIDIA TAO Toolkit is a low-coding tool for fine-tuning models on the domain-specific dataset.
For instance, to further improve the legibility and accuracy of an ASR transcribed text, you can add a custom punctuation and capitalization model to the ASR system that generates text without those features.
Starting from a pre-trained BERT model, the first step is to prepare the dataset. For every word in the training dataset, the goal is to predict the following:
- The punctuation mark that should follow the word
- Whether the word should be capitalized
After the dataset is ready, the next step is training by running a previously provided script. When the training is completed and the desired final accuracy is reached, create the model repository for NVIDIA Triton by using an included script.
The NVIDIA Riva Speech Skills documentation contains more details about how to train or fine-tune other models. This post showed only one of the many customization possibilities using the TAO Toolkit.
Deploying a model in Riva
Riva is designed for speech AI at scale. To help you efficiently serve models across different servers robustly, NVIDIA provides push-button model deployment using Helm charts (Figure 5).
The Helm chart configuration, available from the NGC catalog, can be modified for custom use cases. You can change settings related to which models to deploy, where to store them, and how to expose the services.
Riva skills are in General Availability to NVIDIA Developer Program members for developing real-time transcription, virtual assistants, or custom voice implementation. For large-scale deployments, try Riva Enterprise, which also includes support from AI experts.
Check out Riva Getting Started to deploy Riva speech AI skills and boost your conversation-based solution use case.