Conversational AI

Speech Recognition: Deploying Models to Production

This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Domain-Specific Audio Transcriptions Using NVIDIA Riva. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning.

NVIDIA Riva is an AI speech SDK for developing real-time applications like transcription, virtual assistants, and chatbots. It includes pretrained state-of-the-art models in NGC, the TAO toolkit for fine-tuning models on your domain, and optimized skills for high-performance inference. Riva makes it simpler to deploy the models using Riva containers in NGC or on Kubernetes using Helm charts. Riva skills are powered by NVIDIA TensorRT and served through NVIDIA Triton Inference Server.

In this post, we discuss the following topics:

  • Setting up Riva
  • Configuring Riva and deploying your model
  • Inferencing with your model
  • Key takeaways

The contents of this post are available in a Jupyter notebook, which you can download to follow along. For more information about Riva, see Introducing NVIDIA Riva: A GPU-Accelerated SDK for Developing Speech AI Applications.

Setting up Riva

Before setting up NVIDIA Riva, make sure you have the following installed on your system:

If you followed along in part 2, you should have all the prerequisites already installed.

The first step in setting up Riva is to install the NGC Command Line Interface Tool.

Screenshot of NGC CLI installation screen.
Figure 1. Installing the NGC CLI

To log in to the registry, you must get access to the NGC API Key.

Screenshot of the API generation page.
Figure 2. Getting the NGC API Key

With the tools set up, you can now go ahead and download Riva from the Riva Skills Quick Start resource available on NGC. To download the package, you can use the following command (the latest version of which can be found on the previously mentioned Riva Skills Quick Start resource):

ngc registry resource download-version "nvidia/riva/riva_quickstart:1.6.0-beta"

The downloaded package has the following assets to help you get started:

  • asr_lm_tools:  These tools can be used to fine-tune language models.
  • nb_demo_speech_api.ipynb: Getting started notebook for Riva.
  • riva_api-1.6.0b0-py3-none-any.whl and nemo2riva-1.6.0b0-py3-none-any.whl: Wheel files to install Riva and a tool to convert a NeMo model to a Riva model. For more information, see the Inferencing with your model section later in this post.
  • Quick start scripts (riva_*.sh, Scripts to initialize and run a Triton Inference Server to serve Riva AI Services. For more information, see Configuring Riva and Deploying your model.
  • Examples: Sample gRPC-based client code.

Configuring Riva and deploying your model

You might be wondering where to start. To streamline the experience, NVIDIA makes helps in customizing deployment with Riva by offering a config file to tweak everything you might want to tweak, using Riva AI services. For this walkthrough, you rely on the task-specific Riva ASR AI service.

For this walkthrough, we discuss only a few tweaks. Because you are working with just ASR, you can safely disable NLP and TTS.

# Enable or Disable Riva Services

If you are following along from part 2, you can set the use_existing_rmirs param to true. We discuss this more later in this post.

# Locations to use for storing models artifacts

You can choose the pretrained models to download from the model repository to run without customization.

########## ASR MODELS ##########
### Punctuation model
### Citrinet-1024 Offline w/ CPU decoder,

If you have RIVA models from when you were following Part 2 of this series, first build it into an intermediate format called Riva Model Intermediate Representation (RMIR) format. You can do this using Riva Service Maker. Service Maker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment.

Do this with the riva-build and riva-deploy commands. For more information, see Deploying Your Custom Model into Riva.

docker run --rm --gpus 0 -v <path to model>:/data <name and version of the container> -- \
            riva-build speech_recognition /data/asr.rmir:<key used to encode the model> /data/<name of the model file>:<key used to encode the model> --offline \
            --chunk_size=1.6 \
            --padding_size=1.6 \
            --ms_per_timestep=80 \
            --greedy_decoder.asr_model_delay=-1 \
            --featurizer.use_utterance_norm_params=False \
            --featurizer.precalc_norm_time_steps=0 \
            --featurizer.precalc_norm_params=False \

docker run --rm --gpus 0 -v <path to model>:/data <name and version of the container> -- \
            riva-deploy -f  /data/asr.rmir:<key used to encode the model> /data/models/

Now that you have the model repository set up, the next step is to deploy the model. Although you can do this manually, we recommend using the prepackaged scripts for your first experience. The Quick Start scripts and are the two scripts that can be used to deploy the models using the exact configuration in


When you run

  • The RMIR files for the model that you selected in are downloaded from NGC in the directory that you specified.
  • For each of the RMIR model files, a corresponding Triton Inference Server model repository is generated. This process may take some time, depending on the number of services and the model that you selected.

To use your custom model, copy the RMIR files to the directory that you specified in (for $riva_model_loc). To deploy the model, run A riva-speech container is spun up with the models loaded from your selected repository to the container. Now, you can start sending inference requests.

Inferencing with your model

To make the most out of NVIDIA GPUs, Riva takes advantage of NVIDIA Triton Inference Server and NVIDIA TensorRT. In a conversational setting, applications optimize for as low of a latency as possible, but to use more compute resources, the batch size, that is, the number of requests being synchronously processed must be increased, which naturally increases latency. NVIDIA Triton can be used to serve multiple inference requests running on multiple models on multiple GPUs thus easing this issue.

You can query these models using a gRPC API in three broad steps: Import libs, set up the grpc channel, and get the response.

First, import all dependencies and load the audio. In this case, you are reading audio from a file. We also have a streaming example in the examples folder.

import argparse
import grpc
import time
import riva_api.audio_pb2 as ra
import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import wave

audio_file = "<add path to .wav file>"
server = "localhost:50051

wf =, 'rb')
with open(audio_file, 'rb') as fh:
    data =

To install all the Riva-specific dependencies, you can use the .whl file provided in the package.

pip3 install riva_api-1.6.0b0-py3-none-any.whl

Next, create a grpc channel to the Riva endpoint and configure it to use the audio appropriate for your use case.

channel = grpc.insecure_channel(server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(

Finally, send an inference request to the server and get the response.

request = rasr.RecognizeRequest(config=config, audio=data)

response = client.Recognize(request)

Key takeaways

This API can be used to build your applications. You can install Riva on a single bare-metal system and get started with this exercise or deploy it at scale using Kubernetes and the provided Helm chart.

Riva has 1-click deployment for speech synthesis, language understanding, and speech recognition.
Figure 3. Typical deployment workflow with NVIDIA Riva

With this Helm chart, you can do the following:

  • Pull the Riva Services API server, Triton Inference Server, and other necessary Docker images from NGC.
  • Generate the Triton Inference Server model repository and start the NVIDIA Triton Server with the selected configuration.
  • Expose the Inference Server and Riva server endpoints to be served as Kubernetes services.

For more information, see Deploying Riva ASR Service on Amazon EKS.


Riva is an end-to-end GPU-accelerated SDK for developing speech applications. In this series, we discussed the significance of speech recognition in industries, walked you through customizing speech recognition models on your domain to deliver world-class accuracy, and showed you how to deploy optimized services that can run in real-time using Riva.

For more information about other interesting Riva solutions, see all Riva posts on the Technical Blog.

Discuss (0)