What is Conversational AI?

Like the name suggests, the goal of Conversational AI is to allow people to speak naturally with their computing devices like their phone, HDTV, PC, etc. You speak in your normal voice, and the device understands what you’ve asked, gets an answer, and then speaks it back to you in a voice that sounds natural. Of course, this is much easier said than done. It’s a multi-step process that requires tons of compute power to do well. Amplifying this challenge is that all this work needs to happen in less than a second to deliver a great user experience.

Get Started

How Does Conversational AI Work?

The conversational AI pipeline consists of three major steps: Automatic Speech Recognition (ASR), Natural Language Processing (NLP) or Natural Language Understanding (NLU), and Text-to-Speech (TTS) with voice synthesis. As you talk to a computer, the ASR phase converts the audio signal into text, the NLP stage interprets the question and generates a smart response, and finally the TTS phase converts the text into speech signals to generate audio for the user.

Automatic Speech Recognition (ASR)

ASR is where the pipeline begins. In this critical first step, ASR processes speech-to-text transformations. ASR helps users compose hands-free text messages to friends and individuals who are deaf or hard of hearing to interact with spoken-word communications. ASR also provides a framework for machine understanding. Human language becomes searchable and actionable, giving developers the ability to derive advanced analytics like speaker identification or sentiment analysis.

Pretrained ASR models and notebooks available from NGC.

Try Today

Natural Language Processing / Understanding (NLP/NLU)

Once the ASR step has converted the spoken question into text, an AI service needs to understand language, intent, context, perform a smart action and timely response. Deep learning has been applied to understand language because of its ability to generalize context and accurately respond to language tasks.

For example - typical language tasks that a conversational AI system needs to perform are: QA, entity recognition, intent recognition, sentiment analysis etc. The progress of models for such tasks are tracked in leaderboards like Stanford’s SQuaD leader board.

For language understanding specifically, the models of today have surpassed human level accuracy across several of these tasks. In 2017, the field of language understanding was revolutionized by a new model architecture called Transformer. A variant of Transformer, called the Bidirectional Encoder Representations from Transformer (BERT) network has surpassed human accuracy on several NLP workloads, and developers continue to build on that success. Achieving this level of understanding has also grown both the size and complexity of the model, and BERT uses 340 million unique parameters that need to be trained to achieve these results.

Pretrained NLP models and notebooks available from NGC.

Try Today

Speech Synthesis / Text-To-Speech (TTS)

The last phase of the Conversational AI pipeline is where the service changes text to speech, where you hear the system’s reply to your question that sounds naturally human. This vocal clarity is achieved using deep neural networks that give AI human-like intonation and clearer articulation of words. This step is broken into two parts: speech synthesis, and then text-to-speech (TTS). The speech synthesis process is the reverse of the ASR process, where text is converted into spoken words. However, this part alone yields a voice that sounds unnatural, both in terms of articulation and cadence. TTS takes the “rough” vocal expression, and applies algorithms to dramatically improve the voice’s inflections and rhythm, yielding a voice that sounds decidedly more human.

Pretrained TTS models and notebooks available from NGC.

Try Today

How to Build Conversational AI Applications

Deploying a service with conversation AI can seem daunting, but NVIDIA now has tools to make this process easier, including Neural Modules (NeMo for short), TensorRT and soon, a new technology called NVIDIA Jarvis.

Build and fine-tune models with Neural Modules Toolkit

Neural Modules (NeMo) is a new open source toolkit that makes it possible to easily and safely compose complex neural network architectures for conversational AI using reusable components. Built for speed, NeMo can scale out training to multiple GPUs and multiple nodes.

Try today from:

NGC GitHub

Deploy real-time apps in production with NVIDIA TensorRT

Once your model is built and trained, NVIDIA TensorRT can optimize performance and ease deployment. TensorRT includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput at every stage of the conversational AI pipeline.

Try today from:

NGC GitHub

Create multimodal AI apps with NVIDIA Jarvis

NVIDIA Jarvis is an SDK for easily building and deploying AI applications that fuse vision, speech and other sensors. Jarvis provides several base modules for speech tasks such as intent and entity classification, sentiment analysis, dialog modeling, domain and fulfillment mapping.


Apply For Early Access

Industry Applications for Conversational AI


One of the difficulties facing health care is making it easily accessible. Calling your doctor’s office and waiting on hold is a common occurrence. Connecting with a claims representative can be equally difficult. The implementation of natural language processing (NLP) to train chatbots is an emerging technology within healthcare to address the shortage of healthcare professionals and open the lines of communication with patients.

Another key healthcare application for NLP is in biomedical text mining. This area is often referred to as BioNLP. Given the large volume of biological literature and the increasing rate of biomedical publications, natural language processing is a critical tool in extracting information within the studies published to advance knowledge in the biomedical field; aiding drug discovery and disease diagnosis.

Financial Services

Natural language processing (NLP) is a critically important part of building better chatbots and AI assistants for financial service firms. Among the numerous language models used in NLP-based applications, BERT has emerged as a leader and language model for NLP with machine learning. Using AI, NVIDIA has recently broken records for speed in training BERT, which promises to help unlock the potential for billions of expected conversational AI services coming online in the coming years to operate with human-level comprehension. For example, by leveraging NLP, banks can assess the creditworthiness of clients with little or no credit history.


In addition to healthcare, Chatbot technology is also commonly used for retail applications to accurately analyze customer queries, and generate responses or recommendations. This streamlines the customer journey and improves efficiencies in store operations. NLP is also used for text mining customer feedback and sentiment analysis.


Ready to start building conversational AI applications?

Get Started