This post was updated to include information on the NVIDIA Jarvis open beta.

Real-time conversational AI is a complex and challenging task. To allow real-time, natural interaction with an end user, the models need to complete computation in under 300 milliseconds. Natural interactions are challenging requiring multimodal sensory integration. Model pipelines are also complex and require coordination across multiple services:

  • Automatic speech recognition (ASR)
  • Natural language understanding (NLU)
  • Domain-specific fulfillment services
  • Text-to-speech (TTS)

NVIDIA Jarvis is an end-to-end framework for building conversational AI applications. It includes GPU-optimized services for ASR, NLU, TTS, and computer vision that use state-of-the-art deep learning models.

You can fuse these skills to form multimodal skills in your applications. You can develop novel SOTA model architectures using NVIDIA NeMo and use the NVIDIA Transfer Learning Toolkit to fine-tune models on your custom datasets to get the highest accuracy possible. And you can use other tools in Jarvis to optimize these models for inference, deploy them, and run them as services at scale.

Jarvis is designed to help you access conversational AI functionality easily and quickly. With a few commands, you can access the high-performance services through API operations and try multimodal demos.

Jarvis framework

Jarvis is a fully accelerated, application framework for building multimodal conversational AI services that use an end-to-end deep learning pipeline (Figure 1).

The Jarvis framework includes pretrained conversational AI models, tools, and optimized end-to-end services for speech, vision, and NLU tasks. In addition to AI services, Jarvis enables you to fuse vision, audio, and other sensor inputs simultaneously to deliver skills such as multi-user, multi-context conversations in applications such as virtual assistants, multi-user diarization, and call center assistants.

Using Jarvis, you can easily fine-tune state-of-art-models on your data to achieve a deeper understanding of their specific contexts. Optimize for inference to offer real-time services that run in 150 ms compared to the 25 seconds required on CPU-only platforms.

Figure 1. Jarvis is a platform for multimodal conversation AI development and deployment at scale.

Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR, NLU, text-to-speech (TTS), and a wide range of computer vision AI services. All these AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pretrained models or fine-tune them with your own dataset to further improve model performance

Another main component is Jarvis Core, a component designed to enable you to create sophisticated, multimodal, conversational AI applications. It includes Jarvis Dialog Manager, which is responsible for the following tasks:

  • Context switching in conversations with multiple users
  • Dialog state tracking
  • Addressing user requests with a fulfillment engine

Domains, intents, and entities returned by the Jarvis NLP service are used as the input to the dialog manager, which outputs the next action to take together with a text response. The dialog manager closely works with the fulfillment engine, which is responsible for retrieving domain-specific information to satisfy the user query and executing user-requested commands.

Figure 2 shows the available three major building blocks.  

Figure 2. Jarvis components: Multimodal Skills, Core Components, basic services. 

Jarvis Core provides also building blocks such as Sensor Management and Multimodal Fusion to help you manage the complex challenges of synchronizing different sensory input streams as well as the different timing of launching AI services.

The Multimodal Skills component combines task-specific services to form complex multimodal services. All multimodal applications can be written in a computational graph, with each node being an AI service.

Jarvis leverages Triton to serve multiple models for efficient and robust resource allocation, as well as to achieve high performance in terms of high throughput, low latency, and high accuracy.

For an inspiring demo of what can be built, here’s an introductory video to Jarvis:

Video: Example of using Jarvis.

Jarvis services

To take full advantage of the computational power of the GPUs, Jarvis is based on Triton to serve neural networks and ensemble pipelines that are running efficiently with TensorRT.

The services that Jarvis provides are exposed through API operations accessible using gRPC endpoints that also hide all the complexity to application developers.

Figure 3 shows what the system looks like on the server side. The gRPC API operations are exposed by the API server (running in a Docker container) that is responsible for processing all the computer vision, speech, and natural language processing incoming and outgoing data.

Figure 3. Jarvis services include multiple pipelines.

The API server sends inference requests to Triton and receives the results.

Triton is the backend server that processes multiple inference requests on multiple GPUs for many neural networks or ensemble pipelines at the same time.

For conversational AI applications, it is important to keep the latency below a given threshold. This usually translates into execution of inference requests as soon as they arrive. To saturate the GPUs and increase performance, you must increase the batch size and delay the inference execution until more requests are received and a bigger batch can be formed.

Triton is also responsible for the context switch of networks with state between one request and another.

Jarvis can be installed directly on bare-metal through simple scripts that download the appropriate models and containers from NVIDIA NGC, or it can deployed on Kubernetes through an Helm chart, which is also provided.

I mentioned before how the gRPC endpoints are hiding the complexity of the system to the application developers. Here’s a quick look at how you can interact with Jarvis. 

On the client side, a Python interface makes the communication with a Jarvis server easier through simple Python API operations.

As an example, here’s how a request for an existing TTS Jarvis service is created.

First, import the Jarvis API:

import src.jarvis_proto.jarvis_tts_pb2 as jtts
import src.jarvis_proto.jarvis_tts_pb2_grpc as jtts_srv
import src.jarvis_proto.audio_pb2 as ja

Next, create a gRPC channel to the Jarvis endpoint:

channel = grpc.insecure_channel('localhost:50051')
jarvis_tts = jtts_srv.JarvisTTSStub(channel)

Then, create a TTS request:

req = jtts.SynthesizeSpeechRequest()
req.text = "We know what we are, but not what we may be?"
req.language_code = "en-US"                	
req.encoding = ja.AudioEncoding.LINEAR_PCM 	
req.sample_rate_hz = 22050                 	
req.voice_name = "ljspeech"                	
resp = jarvis_tts.Synthesize(req)
audio_samples = np.frombuffer(, dtype=np.float32)

Training a model with your data

By using NVIDIA NeMo or the NVIDIA Transfer Learning Toolkit, you can use a custom trained model in Jarvis (Figure 4).

Figure 4. Jarvis from training to deploying: Using NeMo or TLT, it is possible to train or fine-tune models to improve the accuracy of Jarvis services, after trained models are exported with TensorRT for optimized inference and deployed in Jarvis.

For instance, to further improve the legibility and accuracy of an ASR transcribed text, you may want to add a custom punctuation and capitalization model to the ASR system that usually generates text without those features.

Starting from a pretrained BERT model, the first step is to prepare the dataset. For every word in the training dataset, the goal is to predict the following:

  • The punctuation mark that should follow the word.
  • Whether the word should be capitalized.

After the dataset is ready, the next step is training by running a script that is already provided.

When the training is completed and the desired final accuracy is reached, create the model repository for Triton by using an included script.

The Jarvis documentation contains more details on how to train or fine-tune other models, here we illustrated only one among the many possibilities of customization using NeMo.

Deploying a model in Jarvis

Lastly, Jarvis is designed for conversation AI at scale. To help you efficiently serve models across different servers robustly, NVIDIA provided push-button model deployment using Helm charts (Figure 5).

Figure 5. Models can be deployed in Jarvis by modifying the available Helm chart.

The Helm chart performs several functions:

  • Pulls Docker images from NGC for the Jarvis Services API server, Triton Inference server, and utility containers for downloading and converting models.
  • Generates the Triton Inference server model repository.
  • Starts the Triton Inference server with the appropriate configuration.
  • Exposes the Triton Inference server and Jarvis servers as Kubernetes services.

The Helm chart configuration can be modified for custom use cases. You can change settings related to which models to deploy, where to store them, and how to expose the services.


Jarvis is available as an open beta to members of the NVIDIA Developer Program. If you have use cases such as virtual assistants, digital avatars, or any ASR/NLP/TTS standalone use case such as transcription, Jarvis is here to enable your development. To get access to the cutting edge conversation AI technology and NVIDIA experts, apply to the Jarvis Beta Program.

For more information, join the upcoming webinar, Training and Deploying Conversational AI Applications with NeMo and Jarvis.