This post was updated to include information on the NVIDIA Riva open beta.
Real-time conversational AI is a complex and challenging task. To allow real-time, natural interaction with an end user, the models need to complete computation in under 300 milliseconds. Natural interactions are challenging requiring multimodal sensory integration. Model pipelines are also complex and require coordination across multiple services:
- Automatic speech recognition (ASR)
- Natural language processing (NLP)
- Domain-specific fulfillment services
- Text-to-speech (TTS)
NVIDIA Riva is an end-to-end framework for building conversational AI applications. It includes GPU-optimized services for ASR, NLP, TTS, and computer vision that use state-of-the-art deep learning models.
You can fuse these skills to form multimodal skills in your applications. You can develop novel SOTA model architectures using NVIDIA NeMo and use the NVIDIA Transfer Learning Toolkit to fine-tune models on your custom datasets to get the highest accuracy possible. And you can use other tools in Riva to optimize these models for inference, deploy them, and run them as services at scale.
Riva is designed to help you access conversational AI functionality easily and quickly. With a few commands, you can access the high-performance services through API operations and try multimodal demos.
Riva is a fully accelerated, application framework for building multimodal conversational AI services that use an end-to-end deep learning pipeline (Figure 1).
The Riva framework includes pretrained conversational AI models, the NVIDIA Transfer Learning Toolkit for fine-tuning these models on a custom dataset, and optimized end-to-end services for speech, vision, and NLP tasks. In addition to AI services, Riva enables you to fuse vision, audio, and other sensor inputs simultaneously to deliver skills such as multi-user, multi-context conversations in applications such as virtual assistants, multi-user diarization, and call center assistants.
Using Riva, you can easily fine-tune state-of-art-models on your data to achieve a deeper understanding of their specific contexts. Optimize for inference to offer real-time services that run in 150 ms compared to the 25 seconds required on CPU-only platforms.
Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR, NLP, text-to-speech (TTS), and a wide range of computer vision AI services. All these AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pretrained models or fine-tune them with your own dataset to further improve model performance
Another main component is Riva Core, a component designed to enable you to create sophisticated, multimodal, conversational AI applications. It includes Riva Dialog Manager, which is responsible for the following tasks:
- Context switching in conversations with multiple users
- Dialog state tracking
- Addressing user requests with a fulfillment engine
Domains, intents, and entities returned by the Riva NLP service are used as the input to the dialog manager, which outputs the next action to take together with a text response. The dialog manager closely works with the fulfillment engine, which is responsible for retrieving domain-specific information to satisfy the user query and executing user-requested commands.
Figure 2 shows the available three major building blocks.
Riva Core provides also building blocks such as Sensor Management and Multimodal Fusion to help you manage the complex challenges of synchronizing different sensory input streams as well as the different timing of launching AI services.
The Multimodal Skills component combines task-specific services to form complex multimodal services. All multimodal applications can be written in a computational graph, with each node being an AI service.
Riva uses NVIDIA Triton to serve multiple models for efficient and robust resource allocation, as well as to achieve high performance in terms of high throughput, low latency, and high accuracy.
To take full advantage of the computational power of the GPUs, Riva is based on NVIDIA Triton to serve neural networks and ensemble pipelines that are running efficiently with TensorRT.
The services that Riva provides are exposed through API operations accessible using gRPC endpoints that also hide all the complexity to application developers.
Figure 3 shows what the system looks like on the server side. The gRPC API operations are exposed by the API server (running in a Docker container) that is responsible for processing all the computer vision, speech, and natural language processing incoming and outgoing data.
The API server sends inference requests to NVIDIA Triton and receives the results.
NVIDIA Triton is the backend server that processes multiple inference requests on multiple GPUs for many neural networks or ensemble pipelines at the same time.
For conversational AI applications, it is important to keep the latency below a given threshold. This usually translates into execution of inference requests as soon as they arrive. To saturate the GPUs and increase performance, you must increase the batch size and delay the inference execution until more requests are received and a bigger batch can be formed.
NVIDIA Triton is also responsible for the context switch of networks with state between one request and another.
Riva can be installed directly on bare-metal through simple scripts that download the appropriate models and containers from NVIDIA NGC, or it can deployed on Kubernetes through an Helm chart, which is also provided.
I mentioned before how the gRPC endpoints are hiding the complexity of the system to the application developers. Here’s a quick look at how you can interact with Riva.
On the client side, a Python interface makes the communication with a Riva server easier through simple Python API operations.
As an example, here’s how a request for an existing TTS Riva service is created.
First, import the Riva API:
import src.riva_proto.riva_tts_pb2 as rtts import src.riva_proto.riva_tts_pb2_grpc as rtts_srv import src.riva_proto.audio_pb2 as ri
Next, create a gRPC channel to the Riva endpoint:
channel = grpc.insecure_channel('localhost:50051') riva_tts = rtts_srv.RivaSpeechSynthesisStub(channel)
Then, create a TTS request:
req = rtts.SynthesizeSpeechRequest() req.text = "We know what we are, but not what we may be?" req.language_code = "en-US" req.encoding = ri.AudioEncoding.LINEAR_PCM req.sample_rate_hz = 22050 req.voice_name = "ljspeech" resp = riva_tts.Synthesize(req) audio_samples = np.frombuffer(resp.audio, dtype=np.float32)
Training a model with your data
By using NVIDIA NeMo or the NVIDIA Transfer Learning Toolkit, you can use a custom trained model in Riva (Figure 4).
For instance, to further improve the legibility and accuracy of an ASR transcribed text, you may want to add a custom punctuation and capitalization model to the ASR system that usually generates text without those features.
Starting from a pretrained BERT model, the first step is to prepare the dataset. For every word in the training dataset, the goal is to predict the following:
- The punctuation mark that should follow the word.
- Whether the word should be capitalized.
After the dataset is ready, the next step is training by running a script that is already provided. When the training is completed and the desired final accuracy is reached, create the model repository for NVIDIA Triton by using an included script.
The Riva documentation contains more details about how to train or fine-tune other models. In this post, we showed only one among the many possibilities of customization using NeMo.
Deploying a model in Riva
Lastly, Riva is designed for conversation AI at scale. To help you efficiently serve models across different servers robustly, NVIDIA provided push-button model deployment using Helm charts (Figure 5).
The Helm chart performs several functions:
- Pulls Docker images from NGC for the Riva Services API server, Triton Inference Server, and utility containers for downloading and converting models.
- Generates the Triton Inference Server model repository.
- Starts the Triton Inference Server with the appropriate configuration.
- Exposes the Triton Inference Server and Riva servers as Kubernetes services.
The Helm chart configuration can be modified for custom use cases. You can change settings related to which models to deploy, where to store them, and how to expose the services.
Riva is available as an open beta to members of the NVIDIA Developer Program. If you have use cases such as virtual assistants, digital avatars, or any ASR, NLP, TTS standalone use case such as transcription, Riva is here to enable your development. To get access to the cutting edge conversation AI technology and NVIDIA experts, apply to the Riva Beta Program.
For more information, see the Riva Getting Started.