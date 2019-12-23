In the last few years, deep learning has improved the state of the art in conversational AI and offered superhuman accuracy on certain tasks. Deep learning has also reduced the need for deep knowledge of linguistics and rule-based techniques for building language services, which has led to widespread adoption across industries like telecommunications, unified communications as a service (UCaaS), retail, healthcare, and finance.

When you present an application with a question, the audio waveform is converted to text during the automatic speech recognition (ASR) stage. The question is then interpreted, and the device generates a smart response during the natural language processing (NLP) stage. Finally, the text is converted into speech signals to generate audio for the user during the text-to-speech (TTS) stage. Several deep learning models are connected into a pipeline to build a conversational AI application.

Over time, the size of models and number of parameters used in conversational AI models has grown. BERT (Bidirectional Encoder Representations from Transformers), a popular language model, has 340 million parameters. Training such models can take weeks of compute time and is usually performed using deep learning frameworks, such as PyTorch, TensorFlow, and MXNet. Models trained on public datasets rarely meet the quality and performance expectations of enterprise apps, as they lack context for the industry, domain, company, and products.

One approach to address these challenges is to use transfer learning. You can start from a model that was pretrained on a generic dataset and apply transfer learning to fine-tune it with proprietary data for specific use cases. Fine-tuning is far less compute intensive than training the model from scratch.

During inference, several models need to work together to generate a response—in only a few milliseconds—for a single query. GPUs are used to train deep learning models and perform inference, because they can deliver 10X higher performance than CPU-only platforms. This makes it practical to use the most advanced conversational AI models in production.