Conversational AI

Speech Recognition: Customizing Models to Your Domain Using Transfer Learning

This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Transcriptions Using NVIDIA Riva. For part 3, see Speech Recognition: Deploying Models to Production.

Creating a new AI deep learning model from scratch is an extremely time– and resource-intensive process. A common solution to this problem is to employ transfer learning. To make this process even easier, the NVIDIA TAO Toolkit, which can cut down an engineering time frame of 80 weeks to 8 weeks. The TAO Toolkit supports both computer vision and conversational AI (ASR and NLP) use cases.

In this post, we cover the following topics:

  • Installing the TAO Toolkit and get access to pretrained models
  • Fine-tuning a pretrained speech transcription model
  • Exporting the fine-tuned model to NVIDIA Riva

To follow along, download the Jupyter notebook.

Installing the TAO Toolkit and downloading pretrained models

Before installing the TAO Toolkit, make sure you have the following installed on your system:

  • python >= 3.6.9
  • docker-ce > 19.03.5
  • nvidia-docker2 3.4.0-1

For more information about installing nvidia-docker and docker, see Prerequisites. You can install the TAO Toolkit with pip. We recommend using a virtual environment to avoid version conflicts.

pip3 install nvidia-pyindex
pip3 install nvidia-tao

With installation out of the way, the next step is to get some pretrained models. NVIDIA has made available many AI or machine learning models, not just in the conversational AI space but in a wide range of domains on NGC or NVIDIA GPU Cloud. The NGC Catalog is a curated set of GPU-optimized software for AI, HPC, and visualization.

To download resources from NGC, log in to the registry with your NGC API key. You can create and use one for free.

Screenshot of the API generation page.
Figure 1. Getting the NGC API Key

CitriNet is a state-of-the-art automatic speech recognition (ASR) model built by NVIDIA, which enables you to generate speech transcriptions. You can download this model from the Speech to Text English Citrinet model card.


To offer a fluid and streamlined experience, the toolkit downloads and runs Docker containers in the background that makes use of the previously mentioned specification files. All the details are hidden with the TAO launcher. You specify your preferred location to mount the Docker container by defining a JSON file: ~/.tao_mounts.json. You can find the mount file in the Jupyter notebook.

           "source": "~/tao/data",
           "destination": "/data" # The location in which to store the dataset
           "source": "~/tao/specs",
           "destination": "/specs" # The location in which to store the specification files
           "source": "~/tao/results",
           "destination": "/results" # The location in which to store the results
           "source": "~/.cache",
           "destination": "/root/.cache"
         "shm_size": "16G",
         "ulimits": {
            "memlock": -1,
            "stack": 67108864

With this, you have TAO Toolkit installed, have downloaded a pretrained ASR model, and specified the mounting point for the TAO Toolkit launcher. In the next section, we discuss how to use TAO Toolkit to fine-tune this model on a dataset of your choice.

Fine-tuning the model

Fine-tuning a model with TAO Toolkit is a three-step process:

  1. Download the spec files.
  2. Preprocess the dataset.
  3. Fine-tune with hyperparameters.

Figure 3 shows the steps needed to fine-tune the model.

For the ASR use case, there are three key steps: download specs, run preprocessing, and then fine-tune. Download the AN4 dataset and a .tlt model from NGC.
Figure 2. TAO Toolkit workflow

Step 1: Download spec files

NVIDIA TAO Toolkit is a low– or no-code solution to simplify the training or fine-tuning of models, through specification files. These files enable you to customize model-specific parameters, trainer parameters, optimizer, and parameters for the dataset being used. These specifications files can be downloaded to the folder mounted earlier:

tao speech_to_text_citrinet download_specs \
    -r <path to results dir>/speech_to_text_citrinet \
    -o < path to specs dir>/speech_to_text_citrinet

Here are the YAML files that come with the TAO toolkit. For more information, see Downloading Sample Spec Files.

  • create_tokenizer.yaml
  • dataset_convert_an4.yaml
  • dataset_convert_en.yaml
  • dataset_convert_ru.yaml
  • evaluate.yaml
  • export.yaml
  • finetune.yaml
  • infer_onnx.yaml
  • infer.yaml
  • train_citrinet_256.yaml
  • train_citrinet_bpe.yaml

These specification files are available for customization and use. There is functionality for everything from preprocessing and model evaluation to inference and exporting the model. This enables you through the journey of developing or customizing models without the need to build elaborate code bases. With the spec files downloaded, you can now proceed to preprocessing the data.

Step 2: Preprocess the dataset

For this walkthrough, you use CMU’s AN4 Dataset, a small census dataset that contains recordings of addresses, numbers, and other personal information. This is similar to the type of transcription that is required in the initial steps of conversations happening in customer support conversations. A larger custom dataset with similar content can be used for a real-world application.

You can directly download and unzip the AN4 dataset or use the following command:

tar -xvf an4_sphere.tar.gz

The TAO Toolkit training and fine-tuning modules expect data to be present in a specific format. This preprocessing can be done using the dataset_convert command. We package specification files for AN4 and Mozilla’s common voice dataset along with the TAO launcher. You can find these specification files in the directory that you defined in Step 1.

These manifest files (Figure 3) contain the following information that is used in the later steps:

  • Path to audio files
  • Duration of each file
  • Content of each file in words
The manifest files specify three things, the path to the audio file, duration of the said file, and the exact transcription for the audio.
Figure 3. Structure of the processed manifest files
tao speech_to_text_citrinet dataset_convert \
    -e <path to specs dir>/speech_to_text_citrinet/dataset_convert_an4.yaml \
    -r  <path to result dir>/citrinet/dataset_convert \
    source_data_dir= </path/to/data/dir>/an4 \

This command converts the audio files to WAV files and generates a train and test manifest files. For more information, see Preparing the Dataset.

In most cases, you would have been done with preprocessing but the CitriNet model is a special case. It requires further processing in the form of subword tokenization, which creates a subword vocabulary for the text. This is different from Jasper or QuartzNet because only single characters are regarded as elements in the vocabulary in their cases. In CitriNet, the subword can be one or multiple characters. This can be done using the following command:

tao speech_to_text_citrinet create_tokenizer \
-e <path to specs dir>/speech_to_text_citrinet/create_tokenizer.yaml \
-r <path to result dir>/citrinet/create_tokenizer \
manifests=<path to data dir>/an4_converted/train_manifest.json \
output_root=<path to data dir>/an4 \

Up to this point, you’ve set up a tool that provides a low-code or no-code solution for a complex problem like transfer learning. You’ve downloaded a pretrained model, processed audio files into the necessary format, and performed tokenization. You did all this with fewer than 10 commands. Now that all the necessary details have been hashed out, you can proceed to fine-tuning the model.

Step 3: Fine-tuning with hyperparameters

As you did in the previous steps, you are interacting with a specification file. For more information, see Creating an Experiment Spec File. You can specify almost everything from training specific parameters like the optimizer, to dataset-specific parameters, to the model configuration itself, if you want to adjust the size of the window size for FFT.

Do you want to change the learning rate and the scheduler, and maybe add a new character in the vocabulary? There’s no need to open your code base and scan through it to make the changes. All these customizations are easily available and shareable across your team. This reduces friction around trying new ideas and sharing the results, as well as the configurations of the models that had better accuracy.

Here’s how to fine-tune the trainer:

  max_epochs: 3   # This is low for demo purposes
tlt_checkpoint_interval: 1

change_vocabulary: true

Here’s how to fine-tune the tokenizer:

  dir: /path/to/subword/vocabulary
  type: "bpe"   # Can be either bpe or wpe

Here’s how to fine-tune the optimizer:

  name: novograd
  lr: 0.01
  betas: [0.8, 0.5]
  weight_decay: 0.001

    name: CosineAnnealing
    warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1

Here’s how to fine-tune the dataset:

# Fine-tuning settings: validation dataset
  manifest_filepath: /path/to/manifest/file/
  sample_rate: 16000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  shuffle: false

  manifest_filepath: ???
  sample_rate: 160000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

Finally, to proceed, modify the specification file as required and run the following command. This command fine-tunes the downloaded model using the dataset downloaded earlier. For more information, see Fine-Tuning the Model.

tao speech_to_text_citrinet finetune \
     -e $SPECS_DIR/speech_to_text_citrinet/finetune.yaml \
     -g 1 \
     -k <key used to encode the model> \
     -m </path to downloaded model>/speechtotext_english_citrinet_1024.tlt \
     -r $RESULTS_DIR/citrinet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     finetuning_ds.num_workers=1 \
     validation_ds.num_workers=1 \
     trainer.gpus=1 \

After fine-tuning or training your model, it is natural to evaluate the model and assess if further fine-tuning is required. To that end, NVIDIA provides capabilities to evaluate your model and run inference.

Exporting the fine-tuned model to Riva

Deploying a model in a production environment presents its own set of challenges. To that end, you can use NVIDIA Riva, a GPU-accelerated AI speech SDK for developing applications like real-time transcription and virtual assistants.

Riva makes use of other NVIDIA products:

  • NVIDIA Triton Inference Server is used to simplify the deployment of models at scale in production.
  • NVIDIA TensorRT is used to accelerate the models and provide better inference performance by optimizing the models for NVIDIA GPUs.

If you are interested in using the model fine-tuned in this walkthrough, you can export it to Riva using the following command. For more information, see Model Export.

tao speech_to_text_citrinet export \
     -e <path to spec dir>/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k <key used to encode the model> \
     -m <path to results dir>/citrinet/train/checkpoints/trained-model.tlt \
     -r <path to results dir>/citrinet/riva \
     export_format=RIVA \

What’s next?

Citrinet for Speech Transcription isn’t the only model or use case that NVIDIA provides. There are multiple use cases and pretrained models in conversational AI and computer vision. For more information, see the NVIDIA TAO Toolkit product page.

In the next post, we cover how to install NVIDIA Riva to deploy these models in a production environment and using one of the many models in the NGC Catalog. For more information, see Speech Recognition: Deploying Models to Production.

Discuss (0)