Announcing NVIDIA NeMo: Fast Development of Speech and Language Models

This is an updated version of Neural Modules for Fast Development of Speech and Language Models. This post is updated with information about pretrained models in NGC and fine-tuning models on custom dataset sections, upgrades the NeMo diagram with the text-to-speech collection, and replaces the AN4 dataset in the example with the LibriSpeech dataset.

As a researcher building state-of-the-art speech and language models, you must be able to quickly experiment with novel network architectures. This experimentation may focus on modifying existing network architectures to improve performance, or it may be higher-level experimentation in which speech and language models are combined to build end-to-end applications.

A typical starting point for this type of work involves looking at example code and pretrained models from model zoos. However,  reusing code and pretrained models this way can be tricky.

The inputs and outputs, coding style, and data processing layers in these models may not be compatible with each other. Worse still, you may be able to wire up these models in your code in such a way that it technically “works” but is in fact semantically wrong. A lot of time, effort, and duplicated code goes into making sure that you are reusing models safely.

As model complexity and model reuse increases, this approach becomes unsustainable.

Now compare this with how you build complex software applications. Here, the history is one of increasing levels of abstraction: from machine code to assembly languages, on to structured programming with compilers and type systems, and finally to object-oriented programming. With these higher-level tools come better guardrails and better code reuse.

Deep learning libraries have been going through a similar evolution.Low-level tools such as CUDA and cuDNN provide great performance, and TensorFlow provides great flexibility at the cost of human effort. However, tensors and simple operations and layers are still the central objects of high-level libraries, such as Keras and PyTorch.

Introducing NVIDIA NeMo

NVIDIA NeMo is an open source toolkit with a PyTorch backend that pushes the abstractions one step further. NeMo makes it possible for you to easily compose complex neural network architectures using reusable components. A semantic compatibility check is done automatically between these components using neural types. 

NeMo uses mixed-precision compute to get the highest performance possible using Tensor Cores on NVIDIA GPUs. It includes capabilities to scale training to multi-GPU systems and multi-node clusters. 

Figure 1. Application stack for NeMo.

At the heart of the toolkit is the concept of a neural module. A neural module takes a set of inputs and computes a set of outputs. It can be thought of as an abstraction that’s somewhere between a layer and a full neural network. Typically, a module corresponds to a conceptual piece of a neural network, such as an encoder, decoder, or language model. 

A neural module’s inputs and outputs have a neural type.That includes the semantics, axis order, and dimensions of the input/output tensor. This typing allows NeMo to be safely chained together to build applications, as in the automatic speech recognition (ASR) example later in this post. 

NeMo also comes with an extendable collection of modules for ASR, NLP, and TTS. These collections provide API actions for data loading, preprocessing, and training different network architectures, including Jasper, BERT, Tacotron 2, and WaveGlow. You can also fine-tune models on a custom dataset using pretrained models in NVIDIA NGC

This toolkit arose out of the challenges faced by the Applied Research team. By open-sourcing this work, NVIDIA hopes to share its benefits with the broader community of speech, NLP, and TTS researchers and encourage collaboration. 

Building a new model

Build a simple ASR model to see how to use NeMo. You see how neural types provide semantic safety checks, and how the tool can scale out to multiple GPUs with minimal effort.

Getting started

The NVIDIA/NeMo GitHub repo outlines the general requirements and installation instructions. The repo shows NeMo installation in several ways:

  • Using an NGC container
  • Using the package manager (pip) command: pip install nemo_toolkit_[all/asr/nlp/tts]. The package manager allows you to install individual NeMo packages based on the collection that you specify, such as asr, nlp, or tts.
  • Using a Dockerfile that can be used to build a Docker image with a ready-to-run NeMo installation.

Use the Introduction to End-To-End Automatic Speech RecognitionJupyter notebook example Jupyter notebook to see how to set up the environment using the pip command and train the Jasper ASR model in NeMo.


For this ASR example, you use a network called Jasper. Jasper is an end-to-end ASR model, which means that it can transcribe speech samples without any additional alignment information. For more information, see Jasper: An End-to-End Convolutional Neural Acoustic Model.

The training pipeline for this model consists of the following blocks. Each of these logical blocks corresponds to a neural module.

Figure 2. Jasper speech recognition pipeline.

Jupyter notebook

The steps of model building in the Jupyter notebook are as follows:

  1. Instantiate the necessary neural modules.
  2. Describe DAG of the neural modules.
  3. Invoke an action, for example, train.

The steps are few because you’re working with higher levels of abstraction.

For an ASR introduction and to explore step-by step-procedures, see the Introduction to End-To-End Automatic Speech Recognition Jupyter notebook.

Fine-tuning models on custom datasets

Fine-tuning plays an important role in building highly accurate models on custom data using pretrained models. It is a technique to perform transfer learning. Transfer learning transfers the knowledge gained in one task to perform another similar task.

There are several pretrained models available in NGC that are a starting point to fine-tune your use case using NeMo. I discuss more in the next section of this post. To learn how to fine-tune models through an example, see Jump-start Training for Speech Recognition Models in Different Languages with NeMo

NeMo exports models to NVIDIA Jarvis with one line of code. It exports in three formats: PyTorch, TorchScript, or ONNX. Jarvis imports models to generate a TensorRT engine and prepare high-performance, production-ready inference models, maximizing throughput by quantizing models to INT8 during inference with TensorRT in Jarvis. 

Pretrained models in NGC

Several pretrained models in NGC are available for ASR, NLP, and TTS such as Jasper, QuartzNet, BERT and Tacotron2 and WaveGlow. These models are trained on thousands of hours of open source and proprietary data to get high accuracy, and required over 100k hours of time on DGX systems.  

Multiple open source and commercial datasets are used to build these pretrained models, such as LibriSpeech, Mozilla Common Voice, AI-shell2 Mandarin Chinese, Wikipedia , BookCorpus and LJSpeech. The datasets help models to gain a deep understanding of the context so they can perform effectively in real-time use cases.  .

 All the pretrained models are downloadable, and the model code is open source, so you can train models on your own dataset or even build new models from the base architecture. Readily available model scripts and containers are available in NGC to fine-tune models for your use case. 

For more information, see the following resources: