NVIDIA recently released NVIDIA Riva with world-class speech recognition capability for enterprises to generate highly accurate transcriptions and NVIDIA NeMo 1.0, which includes new state-of-the-art speech and language models for democratizing and accelerating conversational AI research.
World-class speech recognition
NVIDIA Riva world-class speech recognition is an out-of-the-box speech service that can be easily deployed in any cloud or datacenter. Enterprises can use the Transfer Learning Toolkit (TLT) to customize speech service across a variety of industries and use cases. With TLT, developers can accelerate development of custom speech and language models by 10x.
The speech recognition model is highly accurate and trained on domain-agnostic vocabulary from telecommunications, finance, healthcare, education, and also various proprietary and open-source datasets. Additionally, it was trained on noisy data, multiple sampling rates including 8khz for call centers, variety of accents, and dialogue all of which contribute to the model’s accuracy.
With the Riva speech service, you can generate a transcription in under 10 milliseconds. It is evaluated on multiple proprietary datasets with over ninety percent accuracy and can be adapted to a wide variety of use cases and domains. It can be used in several apps such as transcribing audio in call centers, video conferencing and in virtual assistants.
T-Mobile, one of the largest telecommunication operators in the United States, used Riva to offer exceptional customer service.
“With NVIDIA Riva services, fine-tuned using T-Mobile data, we’re building products to help us resolve customer issues in real time,” said Matthew Davis, vice president of product and technology at T-Mobile. “After evaluating several automatic speech recognition solutions, T-Mobile has found Riva to deliver a quality model at extremely low latency, enabling experiences our customers love.”
You can download the Riva speech service from the NGC Catalog to start building your own transcription application today.
NeMo 1.0
NVIDIA NeMo is an open-source toolkit for researchers developing state-of-the-art (SOTA) conversational AI models. It includes collections for automatic speech recognition (ASR), natural language processing (NLP) and text-to-speech (TTS), which enable researchers to quickly experiment with new SOTA neural networks and create new models or build on top of existing ones.
NeMo is tightly coupled with PyTorch, PyTorch Lightning and Hydra frameworks. These integrations enable researchers to develop and use NeMo models and modules in conjunction with PyTorch and PyTorch Lightning modules. Also, with the Hydra framework and NeMo, researchers can easily customize complex conversational AI models.
Highlights of this version include:
- Added speech recognition support for multiple languages and also new CitriNet and Conformer-CTC ASR models
- Bidirectional Neural Machine Translation models support in five languages from English to Spanish, Russian, Mandarin, German and French
- New speech synthesis models such as Fastpitch, Talknet, Fastspeech2, and also end-to-end models like Fastpitch + HiFiGAN and Fastspeech2 + HiFiGAN
- Features for automatically performing inverse text normalization and denormalization, and also for creating datasets based on CTC-Segmentation and exploring speech datasets
Also, most NeMo models can be exported to NVIDIA Riva for production deployment and high-performance inference.
Learn more about what is included in NeMo 1.0 from the NVIDIA Developer Blog. NeMo is open-sourced and is available for download and use from the NGC Catalog and GitHub.