Developer Blog

AI / Deep Learning | Top Stories |

NVIDIA Accelerates Conversational AI from Research to Production with Latest Updates in NeMo and Jarvis

NVIDIA recently released world class speech recognition capability for enterprises to generate highly accurate transcriptions and NeMo 1.0 which includes new state-of-the-art speech and language models for democratizing and accelerating conversational AI research.

World Class Speech Recognition

Jarvis world class speech recognition is an out-of-the-box speech service that can be easily deployed in any cloud or datacenter. Enterprises can use Transfer Learning Toolkit (TLT) to customize speech service across a variety of industries and use cases.  With TLT, developers can accelerate development of custom speech and language models by 10x.  

The speech recognition model is highly accurate and trained on domain-agnostic vocabulary from telecommunications, finance, healthcare, education, and also various proprietary and open-source datasets. Additionally, it was trained on noisy data, multiple sampling rates including 8khz for call centers, variety of accents, and dialogue all of which contribute to the model’s accuracy. 

With Jarvis speech service, you can generate a transcription in under 10 milliseconds. It is evaluated on multiple proprietary datasets with over ninety percent accuracy and can be adapted to a wide variety of use cases and domains. It can be used in several apps such as transcribing audio in call centers, video conferencing and in virtual assistants.

T-Mobile, one of the largest telecommunication operators in the United States, used Jarvis to offer exceptional customer service.

“With NVIDIA Jarvis services, fine-tuned using T-Mobile data, we’re building products to help us resolve customer issues in real time,” said Matthew Davis, vice president of product and technology at T-Mobile. 

“After evaluating several automatic speech recognition solutions, T-Mobile has found Jarvis to deliver a quality model at extremely low latency, enabling experiences our customers love.”

You can download Jarvis speech service from the NGC Catalog to start building your own transcription application today. 

NeMo 1.0

NVIDIA NeMo is an open-source toolkit for researchers developing state-of-the-art (SOTA) conversational AI models. It includes collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP) and Text-to-Speech (TTS) which enables researchers to quickly experiment with new SOTA neural networks in order to create new models or build on top of existing ones.

 NeMo is tightly coupled with PyTorch, PyTorch Lightning and Hydra frameworks. These integrations enable researchers to develop and use NeMo models and modules in conjunction with PyTorch and PyTorch Lightning modules. Also, with the Hydra framework and NeMo, researchers can easily customize complex conversational AI models.

Highlights of this version include:

  • Added speech recognition support for multiple languages and also new CitriNet and Conformer-CTC ASR models
  • Bidirectional Neural Machine Translation models support in five languages from English to Spanish, Russian, Mandarin, German and French
  • New speech synthesis models such as Fastpitch, Talknet, Fastspeech2, and also end-to-end models like Fastpitch + HiFiGAN and Fastspeech2 + HiFiGAN
  • Features for automatically performing inverse text normalization and denormalization, and also for creating datasets based on CTC-Segmentation and exploring speech datasets

Also, most NeMo models can be exported to NVIDIA Jarvis for production deployment and high-performance inference. 

Learn more about what is included in NeMo 1.0 from the NVIDIA Developer Blog. NeMo is open-sourced and is available for download and use from the NGC Catalog and GitHub.