Create Speech AI Applications in Multiple Languages and Customize Text-to-Speech with Riva

This month, NVIDIA released world-class speech-to-text models for Spanish, German, and Russian in Riva, powering enterprises to deploy speech AI applications globally. In addition, enterprises can now create expressive speech interfaces using Riva’s customizable text-to-speech pipeline.

NVIDIA Riva is a GPU-accelerated speech AI SDK for developing real-time applications like live captioning, adding voice to text-based chatbots, and generating real-time transcription in call centers. For easy implementation, Riva offers highly accurate pretrained models in the NGC catalog.

With the TAO Toolkit, these models can be customized for any industry including telecommunications, finance, unified communications as a service, and healthcare. Developers can use Riva to deploy these models out-of-the-box. They are optimized to run in real time in less than 300 ms in the cloud, data center, and at the edge.

Riva release highlights include

World-class speech recognition skills in Spanish, German, and Russian.
Customizable text-to-speech pipeline for expressive interactions.
Low-code fine-tuning workflow with TAO Toolkit.

Automatic speech recognition in multiple languages

Every conversational AI application, from call centers to virtual assistants, relies heavily on automatic speech recognition. Enterprises can extend these apps globally with Riva automatic speech recognition in English, Spanish, German, and Russian.

The non-English automatic speech recognition models are trained on a variety of open-source datasets, such as Mozilla Common Voice, as well as private datasets. Riva automatic speech recognition models are developed to provide out-of-the-box accuracy and serve as a great starting point for adapting to industry, jargon, dialect, or even noisy surroundings. On popular evaluation datasets, these models deliver world-class accuracy on several industry applications.

Customizable text-to-speech pipelines

For customers to enjoy lifelike dialogues, speech applications must offer human-like expressions. Using Fastpitch, a new model created by the NVIDIA speech AI research team, Riva helps developers customize the text-to-speech pipeline and create expressive speech interfaces. For example, during inference time, developers can vary voice pitch and speed using SSML tags.

The latest state-of-the-art models, such as Fastpitch in Riva, help text-to-speech pipelines run several times faster than other competing options in the market.