What Is Conversational AI?

Conversational AI is the application of machine learning to develop speech and language-based apps that allow humans to interact naturally with devices, machines, and computers using audio. You use conversational AI when your virtual assistant wakes you up in the morning, when asking for directions on your commute, or when communicating with a chatbot while shopping online. You speak in your normal voice, the device understands, finds the best answer, and replies with speech that sounds natural. However, the technology behind conversational AI is complex, involving a multi-step process that requires a massive amount of computing power and computations that must happen in less than 300 milliseconds in order to deliver a great user experience.

Get started

How Does Conversational AI Work?

In the last few years, deep learning has improved the state-of-the-art in conversational AI and offered superhuman accuracy on certain tasks. Deep learning has also reduced the need for deep knowledge of linguistics and rule-based techniques for building language services, which has led to widespread adoption across industries like telecommunications, unified communications as a service (UCaaS), retail, healthcare, and finance.

Typically, the conversational AI pipeline consists of three stages:

  1. Automatic Speech Recognition (ASR)
  2. Natural Language Processing (NLP) or Natural Language Understanding (NLU)
  3. Text-to-Speech (TTS) with voice synthesis

Figure 1: Overview of a conversational AI pipeline

When you present an application with a question, the audio waveform is converted to text during the ASR stage. The question is then interpreted and the device generates a smart response during the NLP stage. Finally, the text is converted into speech signals to generate audio for the user during the TTS stage. Several deep learning models are connected into a pipeline to build a conversational AI application.

There are several freely available datasets to train models for conversational AI tasks. Over time, the size of models and number of parameters used in conversational AI models has grown. BERT, a popular language model, has 340 million parameters. Training such models can take weeks of compute time and is usually performed using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet. Models trained on public datasets rarely meet the quality and performance expectations of enterprise apps as they lack context for the industry, domain, company, and products.

One approach to address these challenges is to use transfer learning. You can start from a model that was pretrained on a generic dataset and apply transfer learning and fine tuning needed for specific use cases using proprietary data. Fine tuning is far less compute intensive than training the model from scratch.

During inference, several models need to work together to generate a response for a single query, requiring the latency for a single model to be only a few milliseconds. GPUs are used to train deep learning models and perform inference because they can deliver 10X higher performance than CPU-only platforms. This makes it practical to use the most advanced conversational AI models in production.

Automatic Speech Recognition

Automatic speech recognition (ASR) takes human voice as input and converts it into readable text. Deep learning has replaced traditional statistical methods, such as Hidden Markov Models and Gaussian Mixture Models, as it offers higher accuracy when identifying phonemes.

Popular deep learning models for ASR include Wav2letter, Deepspeech, LAS, and more recently, Citrinet by NVIDIA Research. OpenSeq2Seq is a popular toolkit for developing speech applications using deep learning. Kaldi is a C++ toolkit that, in addition to deep learning modules, supports traditional methods like those mentioned above. GPU-accelerated Kaldi solutions can perform 3500X faster than real time audio and 10X faster than CPU-only options.

Figure 2: Example of Automatic Speech Recognition (ASR) pipeline

In a typical ASR application, the first step is to extract useful audio features from the input audio and ignore noise and other irrelevant information. Mel Frequency Cepstral Coefficient (MFCC) techniques capture audio spectral features in a Spectrogram or Mel Spectrogram.

Spectrograms are passed to a deep learning based acoustic model to predict the probability of characters at each time step. During training, the acoustic model is trained on datasets [LibriSpeech ASR Corpus, Wall Street Journal, TED-LIUM Corpus, Google Audio set] consisting of hundreds of hours of audio and transcriptions in the target language. The acoustic model output can contain repeated characters based on how a word is pronounced.

The decoder and language model convert these characters into a sequence of words based on context. These words can be further buffered into phrases and sentences, and punctuated appropriately before sending to the next stage.

There are also other training options, such as word-piece encoding and sentence-piece encoding, that we can use to train neural acoustic models to predict characters, words, and sentences. For language models, you can use either an n-gram language model or a neural rescoring language model to determine the correct output sentence.

To save time, pretrained ASR models, the NVIDIA TAO Toolkit, NVIDIA Riva, training scripts, and performance results are available in the NVIDIA NGC™ catalog.

To learn more, refer to the blogs, code samples, and webinar listed below:

Natural Language Understanding

Natural language understanding (NLU) takes text as input, understands context and intent, and generates an intelligent response. Deep learning models are applied for NLU because of their ability to accurately generalize over a range of contexts and languages. Transformer based models, such as BERT (Bidirectional Encoder Representations from Transformers), revolutionized progress in NLU by offering accuracy comparable to human baselines on benchmarks like SQUAD for question answer (QA), entity recognition, intent recognition, sentiment analysis, and more.

Figure 3: Example of Natural Language Understanding (NLU) pipeline

In an NLU application, the input text is converted into an encoded vector using techniques, such as Word2Vec, TF-IDF Vectorization, and Word embedding. These vectors are passed to a deep learning model, such as RNN, LSTM, and Transformer to understand context. These models provide an appropriate output for a specific language task like next word prediction and text summarization, which are used to produce an output sequence.

However, text encoding mechanisms, such as one-hot encoding and word-embedding can make it challenging to capture nuances. For instance, the bass fish and the bass player would have the same representation. When encoding a long passage, they can also lose the context gained at the beginning of the passage by the end. BERT (Bidirectional Encoder Representations from Transformers) is deeply bidirectional, and can understand and retain context better than the other text encoding mechanisms. The key challenge with training language models is the lack of labeled data. BERT is trained on unsupervised tasks and generally uses unstructured datasets from books corpus, English Wikipedia, and more.

Figure 4: Workflow for BERT training and fine-tuning on custom dataset

BERT-Large has 345M parameters, requires a huge corpus, and can take several days of compute time to train from scratch. A common approach is to start from pretrained BERT, add a couple of layers to your task and fine tune on your dataset (as shown in Figure 4). Available open-source datasets for fine-tuning BERT include Stanford Question Answering Dataset (SQUAD), Multi Domain Sentiment Analysis, Stanford Sentiment Treebank, and WordNet.

GPU-accelerated BERT-base can perform inference 17X faster with NVIDIA T4 Tensor Core GPUs than CPU-only solutions. The ability to use unsupervised learning methods, transfer learning with pretrained models, and GPU acceleration has enabled widespread adoption of BERT in the industry.

NGC provides several pretrained NLP models including BERT, the TAO Toolkit, Riva, along with training scripts and performance results.

To learn more, refer to the blogs and code samples listed below:

Text-To-Speech (TTS)

The last stage of the conversational AI pipeline involves taking the text response generated by the NLU stage and changing it to natural-sounding speech. This vocal clarity is achieved using deep neural networks that produce human-like intonation and a clear articulation of words. This step is accomplished with two networks. A Synthesis network generates a spectrogram from text and a vocoder network generates a waveform from the spectrogram. Popular deep learning models for TTS include RadTTS, Fastpitch, HiFiGAN, Wavenet, Tacotron, Deep Voice 1, and Deep Voice 2.

Some of the open source datasets for TTS are LJ Speech, Nancy, TWEB, and LibriTTS that have a text file associated with the audio. Preparing the input text for synthesis requires text analysis, such as converting text into words and sentences, identifying and expanding abbreviations, and recognizing and analyzing expressions. Expressions include dates, amounts of money, and airport codes.

Figure 5: Example of Text-to-Speech (TTS) pipeline

The output from text analysis is passed into Linguistic analysis for refining pronunciations, calculating the duration of words, deciphering the prosodic structure of utterance, and understanding grammatical information.

Output from Linguistic analysis is then fed to a speech synthesis neural network model, such as Fastpitch, which converts the text to Mel Spectrograms and then to a neural vocoder model like HiFiGAN to generate the natural sounding speech.

NGC provides pretrained TTS models, along with training scripts and performance results. GPU-accelerated Fastpitch and HiFiGAN can perform inference 12X faster on NVIDIA A100 Tensor Core GPUs than Tacotron2 and WavGlow on NVIDIA V100 Tensor Core GPUs.

To learn more, refer to the blogs, code samples and Jupyter notebooks listed below:

How to Build Conversational AI Applications

Deploying a service with conversation AI can seem daunting, but NVIDIA has pretrained models and tools to make the process easier, including the TAO Toolkit, NVIDIA Riva and NVIDIA NeMo.

Get Started With Pretrained Models

NVIDIA NGC offers a set of GPU-optimized software for AI, HPC, and visualization. It also includes highly accurate state-of-the-art pretrained conversational AI models for developers to get quickly started.

These models are trained over thousands of GPU hours on a variety of open and proprietary datasets to deliver high accuracy on any custom domain.

Learn More

Train Smarter with TAO Toolkit

The TAO Toolkit is the simplified AI toolkit that can speed up development time by 10X using NVIDIA's production-quality, pretrained models. You can also fine tune these models on your dataset and deploy using Riva for high-performance inference.

Learn More

Build Real-Time Speech AI Applications with NVIDIA Riva

NVIDIA Riva is a GPU-accelerated SDK for developing speech AI applications that delivers real-time performance on GPUs. It includes highly accurate state-of-the-art pretrained models, tools to customize these models on your domain and optimized pipelines that run in real-time. Riva provides skills for several tasks such as generating transcription, translation, and creating custom voices for every brand and virtual assistant.

Learn More

Create SOTA Models with NVIDIA NeMo

NVIDIA NeMo is an open-source toolkit for developers that makes it possible to easily and safely compose complex neural network architectures for conversational AI with reusable components. You can plug and play different components to create new state-of-the-art models. Or you can fine tune readily available pretrained models, such as Jasper, Quartznet, CitriNet, BERT, and Fastpitch. Built for speed, NeMo models can be trained on multiple GPUs and multiple nodes.

Learn More

Industry Applications for Conversational AI


Call centers are one area where conversational AI can be beneficial in the telecom business. They are at the industry’s backbone, handling an average of 2B hours of phone calls daily. Enabling agents at these call centers will save both time and money. Businesses that integrate conversational AI can help assist these agents with real-time recommendations and insights. For instance, by using Automatic Speech Recognition (ASR, one can transcribe the customer phone calls in real-time, analyze them and route to the appropriate person to assist in resolving the query. Additionally, organizations can use these generated transcriptions to understand customer’s sentiment.

Financial Services

Conversational AI applications are enhancing customer service functions at financial institutions by helping users autonomously manage simple tasks, such as making payments, managing refunds, and transactions. It also aids in fraud detection by identifying anomalies from past experiences, activities, and behaviors. In the insurance sector, AI assistants accelerate claims by engaging customers with dynamic conversations.

Unified Communications as a Service

Unified Communications as a Service (UCaaS) offers a wide range of applications and services in the cloud for communication and collaboration. One of the key areas in which UCaaS solutions are used is audio and video conferencing. Speech recognition and neural machine translation can be used in video conferencing apps to generate meeting notes and translation in real-time, allowing for smoother conversations with regional speakers. Companies can also incorporate virtual assistants into their web conferencing applications to help with scheduling and facilitating meetings.


Conversational AI is being applied in the healthcare industry to both make it more accessible and improve the patient care experience. ASR models are being used for physician dictated notes, capturing physician and patient consultations, and for automatically converting speech to text for clinical-documentation. NLU is being utilized for chatbots that assist patients with selecting the right health insurance plan, onboarding, and appointment scheduling. NLU is also used to extract relevant medical information from a large volume of unstructured data to help with medical diagnoses. Additionally, text-to-speech models help the elderly and people suffering from reduced vision or learning disabilities by reading medical information from websites, medication leaflets, and other digital content, aloud.


Chatbots are also commonly used in retail applications to accurately understand customer queries, and generate responses and recommendations. AI virtual assistants allow customers to shop online using only their voice, bridging the gap between physical and virtual shopping, and improving efficiencies in store operations. NLP is also used for mining customer feedback and sentiment analysis, leading to higher customer retention rates.

Ready to start building conversational AI applications?

Get Started