Conversational AI Demystified

Conversational AI is the application of machine learning to develop language based apps that allow humans to interact naturally with devices, machines, and computers using speech. You use conversational AI when your virtual assistant wakes you up in the morning, when asking for directions on your commute, or when communicating with a chatbot while shopping online. You speak in your normal voice, the device understands, finds the best answer, and replies with speech that sounds natural. However, the technology behind conversational AI is complex, involving a multi-step process that requires a massive amount of computing power and computations that must happen in less than 300 milliseconds in order to deliver a great user experience.

Get Started

How Does Conversational AI Work?

In the last few years, deep learning has improved the state-of-the-art in conversational AI and offered superhuman accuracy on certain tasks. Deep learning has also reduced the need for deep knowledge of linguistics and rule-based techniques for building language services, which has led to widespread adoption across industries like retail, healthcare, and finance.

Typically, the conversational AI pipeline consists of three stages:

  • Automatic Speech Recognition (ASR)
  • Natural Language Processing (NLP) or Natural Language Understanding (NLU)
  • Text-to-Speech (TTS) with voice synthesis
Figure 1: Overview of a conversational AI pipeline

When you present an application with a question, the audio waveform is converted to text during the ASR stage. The question is then interpreted and the device generates a smart response during the NLP stage. Finally, the text is converted into speech signals to generate audio for the user during the TTS stage. Several deep learning models are connected into a pipeline to build a conversational AI application.

There are several freely available datasets to train models for conversational AI tasks. Over time, the size of models and number of parameters used in conversational AI models has grown. BERT, a popular language model, has 340 million parameters. Training such models can take weeks of compute time and is usually performed using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet. Models trained on public datasets rarely meet the quality and performance expectations of enterprise apps as they lack context for the industry, domain, company, and products.

One approach to address these challenges is to use transfer learning. You can start from a model that was pre-trained on a generic dataset and apply transfer learning and fine tuning needed for specific use cases using proprietary data. Fine tuning is far less compute intensive than training the model from scratch.

During inference, several models need to work together to generate a response for a single query, requiring the latency for a single model to be only a few milliseconds. GPUs are used to train deep learning models and perform inference because they can deliver 10X higher performance than CPU-only platforms. This makes it practical to use the most advanced conversational AI models in production.

Conversational AI Technologies

Automatic Speech Recognition

Automatic speech recognition (ASR) takes human voice as input and converts it into readable text. Deep learning has replaced traditional statistical methods, such as Hidden Markov Models and Gaussian Mixture Models, as it offers higher accuracy when identifying phonemes.

Popular deep learning models for ASR include Wav2letter, Deepspeech, LAS, and more recently, Jasper by NVIDIA Research. OpenSeq2Seq is a popular toolkit for developing speech applications using deep learning. Kaldi is a C++ toolkit that, in addition to deep learning modules, supports traditional methods like those mentioned above. GPU-accelerated Kaldi solutions can perform 3500X faster than real time audio and 10X faster than CPU-only options.

Figure 2: Example of Automatic Speech Recognition (ASR) pipeline

In a typical ASR application, the first step is to extract useful audio features from the input audio and ignore noise and other irrelevant information. Mel Frequency Cepstral Coefficient (MFCC) techniques capture audio spectral features in a Spectrogram or Mel Spectrogram.

Spectrograms are passed to a deep learning based acoustic model to predict the probability of characters at each time step. During training, the acoustic model is trained on datasets [LibriSpeech ASR Corpus, Wall Street Journal, TED-LIUM Corpus, Google Audio set] consisting of hundreds of hours of audio and transcriptions in the target language. The acoustic model output can contain repeated characters based on how a word is pronounced.

The decoder and language model convert these characters into a sequence of words based on context. These words can be further buffered into phrases and sentences, and punctuated appropriately before sending to the next stage.

To save time, pretrained ASR models, Transfer Learning Toolkit, training scripts, and performance results are available on the NGC software hub.

To learn more, refer to the blogs, code samples, and webinar listed below:

Natural Language Understanding

Natural language understanding (NLU) takes text as input, understands context and intent, and generates an intelligent response. Deep learning models are applied for NLU because of their ability to accurately generalize over a range of contexts and languages. Transformer based models, such as BERT (Bidirectional Encoder Representations from Transformers), revolutionized progress in NLU by offering accuracy comparable to human baselines on benchmarks like SQUAD for question answer (QA), entity recognition, intent recognition, sentiment analysis, and more.

Figure 3: Example of Natural Language Understanding (NLU) pipeline

In an NLU application, the input text is converted into an encoded vector using techniques, such as Word2Vec, TF-IDF Vectorization, and Word embedding. These vectors are passed to a deep learning model, such as RNN, LSTM, and Transformer to understand context. These models provide an appropriate output for a specific language task like next word prediction and text summarization, which are used to produce an output sequence.

However, text encoding mechanisms, such as one-hot encoding and word-embedding can make it challenging to capture nuances. For instance, the bass fish and the bass player would have the same representation. When encoding a long passage, they can also lose the context gained at the beginning of the passage by the end. BERT (Bidirectional Encoder Representations from Transformers) is deeply bidirectional, and can understand and retain context better than the other text encoding mechanisms. The key challenge with training language models is the lack of labeled data. BERT is trained on unsupervised tasks and generally uses unstructured datasets from books corpus, English Wikipedia, and more.

Figure 4: Workflow for BERT training and fine-tuning on custom dataset

BERT-Large has 345M parameters, requires a huge corpus, and can take several days of compute time to train from scratch. A common approach is to start from pre-trained BERT, add a couple of layers to your task and fine tune on your dataset (as shown in Figure 4). Available open-source datasets for fine-tuning BERT include Stanford Question Answering Dataset (SQUAD), Multi Domain Sentiment Analysis, Stanford Sentiment Treebank, and WordNet.

GPU-accelerated BERT-base can perform inference 17X faster with NVIDIA T4 than CPU-only solutions. The ability to use unsupervised learning methods, transfer learning with pre-trained models, and GPU acceleration has enabled widespread adoption of BERT in the industry.

NGC provides several Transfer Learning Toolkit for NLP, pretrained NLP models, including BERT, along with training scripts and performance results.

To learn more, refer to the blogs and code samples listed below:

Text-To-Speech (TTS)

The last stage of the conversational AI pipeline involves taking the text response generated by the NLU stage and changing it to natural-sounding speech. This vocal clarity is achieved using deep neural networks that produce human-like intonation and a clear articulation of words. This step is accomplished with two networks. A Synthesis network generates a spectrogram from text and a vocoder network generates a waveform from the spectrogram. Popular deep learning models for TTS include Wavenet, Tacotron, Deep Voice 1, and Deep Voice 2.

Some of the open source datasets for TTS are LJ Speech, Nancy, TWEB, and LibriTTS that have a text file associated with the audio. Preparing the input text for synthesis requires text analysis, such as converting text into words and sentences, identifying and expanding abbreviations, and recognizing and analyzing expressions. Expressions include dates, amounts of money, and airport codes.

Figure 5: Example of Text-to-Speech (TTS) pipeline

The output from text analysis is passed into Linguistic analysis for refining pronunciations, calculating the duration of words, deciphering the prosodic structure of utterance, and understanding grammatical information.

Output from Linguistic analysis is then fed to a speech synthesis neural network model, such as Tacotron2, which converts the text to Mel Spectrograms and then to a neural vocoder model like Wave Glow to generate the natural sounding speech.

NGC provides pretrained TTS models, along with training scripts and performance results. GPU-accelerated Tacotron2 and Waveglow can perform inference 9X faster with an NVIDIA T4 GPU than CPU-only solutions.

To learn more, refer to the blogs, code samples and Jupyter notebooks listed below:

How to Build Conversational AI Applications

Deploying a service with conversation AI can seem daunting, but NVIDIA now has tools to make this process easier, including Neural Modules (NeMo for short), TensorRT and soon, a new technology called NVIDIA Jarvis.

Build SOTA models with NVIDIA NeMo

Neural Modules (NeMo) is a new open source toolkit that makes it possible to easily and safely compose complex neural network architectures for conversational AI using reusable components. Built for speed, NeMo can scale out training to multiple GPUs and multiple nodes.

NGC GitHub

Train Smarter with Transfer Learning Toolkit

TLT is the simplified AI toolkit that can speed up development time by 10x using NVIDIA's production quality pre-trained models, apply transfer learning and minimum fine-tuning and deploying with Jarvis.

Learn More NGC

Deploy real-time apps in production with NVIDIA TensorRT

Once your model is built and trained, NVIDIA TensorRT can optimize performance and ease deployment. TensorRT includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput at every stage of the conversational AI pipeline.

NGC GitHub

Create multimodal AI apps with NVIDIA Jarvis

NVIDIA Jarvis is an SDK for easily building and deploying AI applications that fuse vision, speech and other sensors. Jarvis provides several base modules for speech tasks such as intent and entity classification, sentiment analysis, dialog modeling, domain and fulfillment mapping.

Learn More

Additional Resources

Industry Applications for Conversational AI


One of the difficulties facing health care is making it easily accessible. Calling your doctor’s office and waiting on hold is a common occurrence. Connecting with a claims representative can be equally difficult. The implementation of natural language processing (NLP) to train chatbots is an emerging technology within healthcare to address the shortage of healthcare professionals and open the lines of communication with patients.

Another key healthcare application for NLP is in biomedical text mining. This area is often referred to as BioNLP. Given the large volume of biological literature and the increasing rate of biomedical publications, natural language processing is a critical tool in extracting information within the studies published to advance knowledge in the biomedical field; aiding drug discovery and disease diagnosis.

Financial Services

Natural language processing (NLP) is a critically important part of building better chatbots and AI assistants for financial service firms. Among the numerous language models used in NLP-based applications, BERT has emerged as a leader and language model for NLP with machine learning. Using AI, NVIDIA has recently broken records for speed in training BERT, which promises to help unlock the potential for billions of expected conversational AI services coming online in the coming years to operate with human-level comprehension. For example, by leveraging NLP, banks can assess the creditworthiness of clients with little or no credit history.


In addition to healthcare, Chatbot technology is also commonly used for retail applications to accurately analyze customer queries, and generate responses or recommendations. This streamlines the customer journey and improves efficiencies in store operations. NLP is also used for text mining customer feedback and sentiment analysis.

Ready to start building conversational AI applications?

Get Started