Developer Blog

In the past several months, many of us have grown accustomed to seeing our doctors over a video call. It’s certainly convenient, but after the call ends, those important pieces of advice from your doctor start to slip away. What was that new medication I needed to take? Were there any side effects to watch out for?

Conversational AI can help in building an application to transcribe speech as well as highlight important phrases from that transcript. NVIDIA Jarvis is a platform that reduces your time in building and deploying state-of-the-art deep learning models that can be used for these tasks.

Two people hold a video call while Jarvis transcribes the conversation and tags medical terms.
Figure 1. Video chat, with transcription and NER using NVIDIA Jarvis.

In this post, we show you how to build a web app that can transcribe speech from a live video chat and tag key phrases in the transcript. The video chat uses PeerJS, an open-source, peer-to-peer chat framework based on WebRTC. For live transcription, you use automatic speech recognition (ASR) from Jarvis. Tagging the key phrases in the transcript uses named entity recognition (NER), also from Jarvis. We also show you how to train an NER model using data from the medical domain. Though we do include code examples, we omit some technical details in the interest of clarity, so we encourage you to take a look at the Jarvis Samples Docker container.

The starting point for this app is a simple peer-to-peer, video call web app. It contains the following resources:

  • One HTML page
  • One client JavaScript file
  • One server JavaScript file, to host the assets and set up the peer-to-peer connection

We kept the tutorial to a bare minimum, so keep in mind that a real application should be significantly more complex. It would include things like identity and session management, alerts, analytics, and more robust network handling.

Block diagram: web client holds a peer-to-peer video chat and a chat demo server negotiates the call connection
Figure 2. Basic peer-to-peer video chat.

In this post, we focus on how to add ASR and NLP capabilities to the web app and skip some details about the app structure. To summarize the app, it’s a simple server, implemented in Node.js, that uses Express to host web assets, and PeerJS to help clients connect to each other in a peer-to-peer, WebRTC video chat. On the client, the browser loads the web page, then talks to the server to help set up a connection with a peer. After that peer connection is established, the two clients communicate directly with each other. Video is no longer routed through the server.

At this point, users can load the webpage, contact another user, and have a live video chat.

Adding ASR and NLP

NVIDIA Jarvis is an SDK to quickly deploy high-performance, conversational AI services. The Jarvis quick start resources offer an easy-to-follow guide to deployment to the Jarvis inference server. After downloading the resources to your server, this comes down to a few basic steps:

  1. Configure the deployment in
  2. Download, optimize, and prepare the models by running
  3. Launch the Jarvis skills server, with

After the server is launched, it creates several gRPC endpoints to help the app communicate with Jarvis. To ensure everything is working correctly, try launching the client container from the server where you’ve set up Jarvis. Call, then check out the example clients, explore the notebooks, and get a sense for what Jarvis offers.

Block diagram: web client streams audio to the server. Chat demo server connects to NVIDIA Jarvis over gRPC. Jarvis provides speech recognition and named entity recognition.
Figure 3. Peer-to-peer chat, with ASR and NLP from Jarvis.

Figure 3 shows the main components of the app, now that you’ve added Jarvis. The chat demo server (Node.js app) still sets up the video call, and now it also communicates with the Jarvis server.

In the app, you use Jarvis for two features: to get a streaming transcript of the conversation and to tag key phrases (named entities) in that transcript. To do this, extract an audio stream from the client and pass that audio to the Node.js server. The server calls Jarvis, using gRPC, to get transcripts and named entities, and passes the results back to the client. The client can then render the transcripts in the browser and pass the transcripts over the peer-to-peer connection so that both users can see the whole conversation.

Data flow: web client sends audio to the Node.js app. The app sends an ASR request to Jarvis and gets a transcript. The app sends a NER request to Jarvis and gets entities.
Figure 4. Data flow through the app.

Getting audio from the web client

At the client, you can get access to an audio stream by tapping into the local WebRTC stream that is being sent to the peer for the video chat. In the client JavaScript file, initialize the Jarvis connection with the server when the user chooses  Start. You are sending audio data over a socket connection, so first make sure that the socket is active:

socket = socketio.on('connect', function() {
    console.log('Connected to speech server');

WebRTC audio works with the concept of a processing graph. To do some work with audio in the browser, do the following:

  1. Connect to an audio source, in this case from the local video-chat stream.
  2. Create a processing node to do some work with that audio.
  3. Connect that new node back to the original destination, where the audio was going originally, before you got involved.

Every time that you get a full buffer of audio coming into the new processing node, resample it using a web worker and send the resampled buffer over the socket connection to the server. Set up the audio source connection and initialize the resampler:

audio_context = new AudioContext();
sampleRate = audio_context.sampleRate;
let audioInput = audio_context.createMediaStreamSource(localStream);
let bufferSize = 4096;
let recorder = audio_context.createScriptProcessor(bufferSize, 1, 1);
let worker = new Worker(resampleWorker);
    command: 'init',
    config: {
        sampleRate: sampleRate,
        outputSampleRate: 16000

The browser fires an event every time the buffer fills up, so tell the processor node what to do with it. Resample it using the worker thread and then pass it to the server using the socket connection:

recorder.onaudioprocess =  function(audioProcessingEvent) {
    let inputBuffer = audioProcessingEvent.inputBuffer;
        command: 'convert',
        // You only need the first channel
        buffer: inputBuffer.getChannelData(0)
    worker.onmessage = function(msg) {
        if ( == 'newBuffer') {

It isn’t completely necessary to resample audio before sending it to Jarvis. Jarvis can do its own resampling. However, doing it in the browser both reduces bandwidth requirements and simplifies some of the differences from one recording source to another. Now, you can connect the new processor node into the audio graph, both the source audio and the destination:


At this point, the app can pull audio from the user’s microphone, resample the stream, and send the resampled audio to the server using a socket. Next, we show you how to use this audio on the server.

Routing the audio to Jarvis

The main server script, implemented in Node.js, uses an Express server and Socket.IO to handle the incoming connections. When the socket first connects, set up the Jarvis connection.

io.on('connect', (socket) => {
    console.log('Client connected from %s', socket.handshake.address);
    // Initialize Jarvis
    socket.handshake.session.asr = new ASRPipe();
        var nlpResult;
        if (result.transcript == undefined) {
        // Final transcripts also get sent to NLP before returning
        if (result.is_final) {
            .then(function(nerResult) {
                result.annotations = nerResult;
                socket.emit('transcript', result);
            }, function(error) {
                result.annotations = {err: error};
                socket.emit('transcript', result);
        } else {
            socket.emit('transcript', result);

There are a few things happening here. You create a new ASRPipe and attach it to the socket’s handshake.session object so that you have a separate Jarvis stream associated with each client connection. Do some basic Jarvis setup with setupASR, then start the ASR listening loop.

The ASR listening loop is asynchronous. You periodically send it batches of audio data, and it periodically sends results through a callback function. The callback is the function that you pass to mainASR. In streaming mode, Jarvis can send back two kinds of results: an interim hypothesis, which tends to change as more audio comes in (giving more context), and a final transcript. Transcripts tend to come back as “final” whenever there is a short pause in the audio, like when the speaker takes a breath. You send both kinds of results back to the client, but when you do get finalized results, you also send those transcripts to the NLP service for NER. Either way, pass results back to the client over the same socket connection, using a transcript event.

With Socket.IO, you set up listeners for specific events. One of those events was mentioned earlier:  the audio_in event that fires whenever the client sends a bundle of audio data. At the server end, add a listener to the same io.on('connect') scope used to initialize Jarvis.

socket.on('audio_in', (data) => {
    socket.handshake.session.asr.recognizeStream.write({audio_content: data});

This part is simple because it doesn’t need to do much. Having set up the Jarvis stream when you connected the socket, all you have to do with new audio content is pass it on.

Sending speech recognition requests

Now look at the gRPC interface itself, starting with ASR. When connecting to a gRPC service using Node.js, there are three main steps:

  • Import the Jarvis API using protocol buffers.
  • Write convenience functions around the API.
  • Mediate data between the client and the Jarvis functions.

In the asr.js module, where you define the ASRPipe class called earlier, first import the Jarvis API:

const jAudio = require('./protos/src/jarvis_proto/audio_pb');
var asrProto = 'src/jarvis_proto/jarvis_asr.proto';
var protoOptions = {
    keepCase: true,
    longs: String,
    enums: String,
    defaults: true,
    oneofs: true,
    includeDirs: [protoRoot]
var asrPkgDef = protoLoader.loadSync(asrProto, protoOptions);
var jAsr = grpc.loadPackageDefinition(asrPkgDef).nvidia.jarvis.asr;

Then, define the ASRPipe class along with the setup function called earlier:

class ASRPipe {
    setupASR() {
        // the Jarvis ASR client
        this.asrClient = new jAsr.JarvisASR(process.env.JARVIS_API_URL, grpc.credentials.createInsecure());
        this.firstRequest = {
            streaming_config: {
                config: {
                    encoding: jAudio.AudioEncoding.LINEAR_PCM,
                    sample_rate_hertz: 16000,
                    language_code: ‘en-US’,
                    max_alternatives: 1,
                    enable_automatic_punctuation: true
                interim_results: true

Here, you create a Jarvis ASR client and define some configuration parameters, which get sent as the first request to the stream when it’s open. In the same class definition, specify the mainASR function to set up the actual ASR stream:

async mainASR(transcription_cb) {
    this.recognizeStream = this.asrClient.streamingRecognize()
    .on('data', function(data){
        if (data.results == undefined || data.results[0] == undefined) {
        // transcription_cb is the result-handling callback
            transcript: data.results[0].alternatives[0].transcript,
            is_final: data.results[0].is_final
    .on('error', (error) => {
        console.log('Error via streamingRecognize callback');
    .on('end', () => {
        console.log('StreamingRecognize end');
    // First request to the stream is the configuration

The streamingRecognize function is asynchronous. The data event is triggered whenever Jarvis has results to send, so repackage those results and send them to the callback function from earlier.

Sending NER requests

Calling the Jarvis NER service is more straightforward. Load the NLP API as before, then call a ClassifyTokens function with each line of text to process. Each request sends the text along with the Jarvis-deployed model to use. Do some post-processing if needed, in a function called computeSpans, then hand off the results.

function getJarvisNer(text) {
    var entities;
    req = { text: , model: {model_name: process.env.JARVIS_NER_MODEL} };
    return new Promise(function(resolve, reject) {
        nlpClient.ClassifyTokens(req, function(err, resp_ner) {
            if (err) {
            } else {
                entities = computeSpans(text, resp_ner.results[0].results);
                resolve({ner: entities});

At this point, you’re finished with the gRPC calls to Jarvis. You can capture audio at the client, send it in a streaming connection to Jarvis to get transcripts, and tag named entities in the text. Every time that Jarvis sends back results, pass them on to the user’s web client over a socket with a transcript event. Now, complete the circuit by handling these results back in the browser.

Rendering the results in the browser

Now that you have transcripts with annotations coming back to the web client, display them in the browser. Recall that all the client-server communication is happening over a Socket.IO connection, so set up a listener for the transcript event that carries the results.

socket.on('transcript', function(result) {
    document.getElementById('input_field').value = result.transcript;
    if (result.is_final) {
        // Erase input field
        showAnnotatedTranscript(username, result.annotations, result.transcript);
        // Send the transcript to the peer to render
        if (peerConn != undefined && callActive) {
            peerConn.send({from: username, type: 'transcript', annotations: result.annotations, text: result.transcript});

The input_field element is a handy place in the web UI to show interim transcripts, which get updated in real time as you speak. In the full app, you use the same field to send text-only requests. When the transcripts are marked as final, show them in a separate box and send a copy of the transcript to the other person on the call, so that you can see both sides of the conversation.

Rendering the transcript itself is standard HTML and CSS. To make your life easier, use the excellent displaCy-ENT to mark named entities in line with the text.

Fine-tune a model for medical NER

By default, Jarvis offers an NER model that handles entities like Location, Person, Organization, and Time. This is great for many apps, like understanding news articles and building chatbots. Earlier, we discussed how conversational AI might help a telemedicine app. Here’s how you might train an NER model for Jarvis to tag medical entities.

Training models from scratch is generally a time-intensive process. Instead of starting from a fresh new model, you could use an existing trained model and fine-tune it on custom data. The NVIDIA Transfer Learning Toolkit (TLT) is a Python-based AI toolkit specifically designed to reduce the time required for fine-tuning and customizing pretrained models with your data.

Because medical data can be highly sensitive, it’s not always easy to find online. One excellent NER dataset comes from the 2010 i2b2/VA challenge, which contains de-identified physician notes tagged for Problems (like diseases or symptoms), Treatments (including medications), and Tests. You can apply for access to the dataset, which is a standard competitive benchmark used in the medical NLP community.

NER data is usually provided in some form of IOB tagging, where each token in the text is labelled as either Beginning an entity, Inside an entity (not the first word), or Outside. For medical text, it often looks something like the following:


Coronary artery disease , status post coronary artery bypass graft .


B-problem I-problem I-problem O O O B-treatment I-treatment I-treatment I-treatment O

This is the data to use as input for the TLT. For a thorough guide to using TLT to train an NER model, see the training notebook in the TLT-Jarvis NER Collection. In this case, you start with a pretrained language model checkpoint, bert-base-uncased, and tune it for the NER task using the preprocessed i2b2 data.

Training and deploying a custom model requires a few steps. Starting with the pretrained checkpoint, you fine-tune the model in TLT with the data. Export the model, again with TLT, into an optimized form for Jarvis. Providing some basic deployment settings to Jarvis, build an intermediate form that bundles the configuration. Then, deploy that package to create a running Jarvis server. For more information, see NVIDIA Jarvis Speech Skills.

Block diagram: 1) TLT train 2) TLT export 3) Jarvis build 4) Jarvis deploy
Figure 5. Training and deploying a fine-tuned model.

The following command fine-tunes the pretrained model on your custom data using TLT:

!tlt  token_classification train \
      -e $SPECS_DIR/train.yaml \      # Specification file
      -g 1 \
      -k $KEY \
      -r $RESULTS_DIR/medical_ner \
      data_dir={destination_mount}/data/i2b2 \
      model.label_ids={destination_mount}/data/i2b2/label_ids.csv \

After this is complete, TLT saves the model in a file called trained-model.tlt. The next step is to export this model to an ejrvs format that Jarvis can use for deployment:

!tlt  token_classification export \
      -e $SPECS_DIR/export.yaml \      # Specification file
      -g 1 \
      -m $RESULTS_DIR/medical_ner/checkpoints/trained-model.tlt \
      -k $KEY \
      -r $RESULTS_DIR/medical_ner \

The model is now exported as exported-model.ejrvs, which you can use in Jarvis.

Using the Jarvis ServiceMaker Docker image, build and deploy the new model.

docker pull
docker run --gpus all -it --rm
      -v $RESULTS_DIR/medical_ner:/servicemaker-dev
      -v $JARVIS_REPO_DIR:/data
jarvis-build token_classification
jarvis-deploy /data/med-ner.jmir /data/models -f

The --IOB flag tells Jarvis to interpret the model output as an IOB-tagged NER model, which simplifies the model output. $JARVIS_REPO_DIR is the location of the Jarvis repository, created when you ran from the quick start scripts. That repository contains a models subdirectory with all of the deployed models, including the default general-domain NER. When you call jarvis-deploy, Jarvis inserts the new NER model into that location.

With this new NER model in place, you can now get medical-domain tagging in the app, appearing in real time through the conversation.

Transcript with NER tagging for ‘my blood sugar levels (TEST)’, ‘type two diabetes (PROBLEM)’, and ‘an insulin pump (TREATMENT)’.
Figure 6. Medical-domain tagging using an NER model tuned with TLT.

Deploy to production

Jarvis is designed to be highly scalable and apps developed using the Jarvis framework can be deployed either on the cloud or in an on-premises Kubernetes cluster. Jarvis provides a sample Helm chart that can be used to get started:

Install Kubernetes, Helm 3.0, and the NVIDIA GPU Operator for Kubernetes on your cluster. Next, download the Jarvis AI services Helm chart from NGC.

export NGC_API_KEY=<your_api_key>
helm fetch --username='$oauthtoken' --password=<YOUR API KEY>

After untarring the compressed folder, find the files required for deployment under /jarvis-api.

├── Chart.yaml
├── templates
│   ├── deployment.yaml
│   ├── _helpers.tpl
│   └── service.yaml
└── values.yaml

The Chart.yaml file contains information about the Helm deployment, like name, version, and so on. To change the deployment configuration, look at the values.yaml file and change the configuration as required:

  • replicaCount: Number of Jarvis service replicas.
  • speechServices [asr | nlp | tts]: Three Boolean parameters to enable speech services.
  • ngcModelConfigs: Model configurations to download from NGC.
  • service: Load-balancing service to deploy in production. 

The Kubernetes deployment file that reads values from the values.yaml file is in the templates folder. The Jarvis sample deployment on the Kubernetes cluster performs the following operations:

  1. Find the GPU nodes and pull the Jarvis Speech Docker container with pretrained models.
  2. Mount Docker volumes containing model directories.
  3. Pull, set up, and run Triton Inference Server.
  4. Open ports for inbound inference requests and the outbound response.
  5. Set up the Prometheus service for pulling the GPU and inference metrics.

Finally, to deploy the Jarvis server, run the following command:

helm install jarvis_server jarvis-api

Alternatively, use the --set option to install without modifying the values.yaml file. Make sure to set the NGC_API_KEY,, and model_key_string values appropriately. By default, the model_key_string option is set to tlt_encode.

helm install jarvis-api --set ngcCredentials.password=`echo -n $NGC_API_KEY | base64 -w0` --set --set modelRepoGenerator.modelDeployKey=`echo -n model_key_string | base64 -w0`

> NAME: jarvis-api
LAST DEPLOYED: Thu Jan 28 12:05:36 2021
NAMESPACE: default
STATUS: deployed

Check the logs to see whether the Jarvis server is deployed without any errors:

kubectl get pods
kubectl logs <pod name>

To make inference requests to Jarvis server, you must get the IP address of the load balancer:

kubectl get services
> NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                                                                        AGE
jarvis-api   LoadBalancer   8000:30034/TCP,8001:31749/TCP,8002:30708/TCP,50051:30513/TCP,60051:31739/TCP   2m19s
kubernetes   ClusterIP       <none>                                                                   443/TCP                                                                        123m

The EXTERNAL-IP value can then be used in env.txt as the external endpoint:

 JARVIS_API_URL= <external-IP> 

In an ideal microservices deployment architecture, the sample web app should also be included in the Helm deployment. However, for this post, leave the Node.js app out of the cluster environment. Use the cluster IP address from the previous command in the sample app and test the Jarvis ASR and NLP capabilities at scale.


It’s difficult to build a highly performant and scalable conversational AI app that is customized for your use case. In this post, we discussed how you can easily add audio transcription and named entity recognition capabilities to an existing app using NVIDIA Jarvis. We also went over how you can customize the app using Transfer Learning Toolkit and how to use a Helm chart for deploying the application at scale. You can get started for free by downloading Jarvis today.