At the core of understanding people correctly and having natural conversations is automatic speech recognition (ASR). To make customer-led voice assistants and automate customer service interactions over the phone, companies must solve the unique challenge of gaining a caller’s trust through qualities such as understanding, empathy, and clarity.
Telephony-bound voice is inherently challenging from a speech recognition perspective. Background noise, poor call quality, and various dialects and accents make understanding a caller’s words difficult. Traditional language understanding systems have limited support for voice in general, and how a person speaks differs fundamentally from how they type or text.
In this post, we discuss PolyAI’s exploration journey with third-party, out-of-the-box, and in-house customized NVIDIA Riva ASR solutions. The goal is to deliver voice experiences that let callers speak however they like, providing helpful and natural responses at every turn of the conversation. The in-house fine-tuned Riva ASR models resulted in notable accuracy improvement on a variety of different validation real-world customer call datasets.
Out-of-the-box ASR challenges for effective customer interactions
Out-of-the-box ASR tools are typically prepared for non-noisy environments and speakers who clearly enunciate and have expected accents. These systems can’t predict what a caller will say, how they might say it, or their speaking tempo. While out-of-the-box solutions can be useful, they can’t be tailored to specific business needs and objectives.
To achieve accurate voice assistants that handle customer interactions efficiently, organizations require an ASR system that can be fine-tuned to significantly improve word error rate (WER).
Advantages and challenges of building an in-house ASR solution
To truly understand people from different places, with different accents, and in noisy environments, conversational systems can use multiple ASR systems, phoneme matching, biasing keywords, and post-processing tools.
The machine learning team at PolyAI rigorously tested numerous ASR systems, often on multiple models, and applied spoken language understanding (SLU) principles to improve transcription accuracy (Figure 1). This work significantly improved the accuracy of speech recognition in real customer phone calls.
Optimizing the caller experience further required the development of an in-house solution.
The PolyAI tech stack enables voice assistants to accurately understand alphanumeric inputs and people from different places, with different accents, and in noisy environments.
Developing an in-house solution approach offers the following advantages:
- Better accuracy and performance with flexible fine-tuning of model parameters on extensive data and voice activity detector (VAD) adaptation for the specific ways in which people talk with the system.
- Full compliance with a bring-your-own-cloud (BYOC) approach that delivers the model and the whole conversational system to clients with zero data transfers to third-party providers.
With great benefits comes a unique set of challenges. Building an in-house solution requires heavy investment in the following areas:
- Expensive pretraining data: Most models require large quantities of good quality, annotated, pretraining data.
- Latency optimization: This area is often overlooked in the research process. Contrary to chat conversation, voice conversation operates on milliseconds. Every millisecond counts. Adding latency at the start of the conversation gives even less time when calling the large language models (LLM) or text-to-speech (TTS) models.
Choosing and finetuning ASR models for an in-house solution
After a substantial search for an ASR solution that addresses building in-house solution challenges, PolyAI decided to use NVIDIA Riva for the following reasons:
- Cutting-edge accuracy of pretrained models trained on a substantial volume of conversational speech data.
- Enhanced accuracy with full model customization, including acoustic model customization for different accents, noisy environments, or poor audio quality.
- High inference performance based on tight coupling with NVIDIA Triton Inference Server and battle-tested to handle machine learning servicing.
Initial trials with an in-house ASR model provided valuable insights into the fine-tuning process. This led to the development of a robust and flexible fine-tuning methodology, incorporating diverse validation sets to ensure optimal performance.
Conversational system for testing out-of-the-box and in-house ASR solutions
Typical conversational systems use public switched telephone networks (PSTN) or session initiation protocol (SIP) connections to transfer calls into the tech stack.
Call information from these systems is then sent to third-party ASR cloud service providers or in-house ASR solutions. For PolyAI’s testing of ASR solutions (Figure 2), after a call is transcribed, it is sent to a PolyAI voice assistant, where natural language models generate a response. The response is then transferred back into the audio wave through in-house TTS or third-party providers.
Creating a real-world ASR testing dataset
PolyAI identified 20 hours of the most challenging conversations split equally between UK and US region calls to test the accuracy of third-party, out-of-the-box, and in-house ASR solutions. These were the calls with noisy environments and ones where other ASR models—in-house or third-party providers—had previously failed.
These failure calls varied from single-word utterances, such as ‘yes’ or ‘no’ answers, to much longer responses. PolyAI manually annotated them and established a word error rate (WER) below 1%, essential when dealing with fine-tuning ASR models.
Notable accuracy improvement of an in-house customized ASR solution
Fine-tuning two in-house ASR models using only 20 hours of data already resulted in a notable mean WER improvement for the US English model, reducing it by ~8.4% compared to the best model from CSP (Table 1). The importance of choosing the right model should be noted since different CSP out-of-the-box ASR models resulted in 44.51% mean WER.
Even more remarkable is that the WER median of in-house US English ASR solution reached 0%. This achievement was validated across various data sets, ensuring the fine-tuning was not overfitting a specific use case. This versatility allows the model to perform well across different projects where people use specific keywords, enabling the accurate understanding of particular phrases and enhancing overall median performance.
US English | Provider | Model | Language | WER Mean [%] | WER MEdian [%] |
0 | Poly AI | Fine-Tuned | En-US | 20.32 | 0.00 |
1 | Poly AI | Fine-TUned | En-All | 22.19 | 7.14 |
2 | CSP | Best | En-US | 22.22 | 7.69 |
9 | CSP | Worst | En-US | 44.51 | 33.33 |
A similar pattern is observed with the UK English ASR solution (Table 2).
UK English | Provider | Model | Language | WER Mean [%] | WER MEdian [%] |
0 | Poly AI | Fine-Tuned | En-UK | 20.99 | 8.33 |
1 | Poly AI | Fine-TUned | En-All | 22.77 | 10.00 |
2 | CSP | Best | En-UK | 25.15 | 14.29 |
9 | CSP | Worst | En-UK | 33.46 | 25.00 |
Only 20 hours of fine-tuning data demonstrates the potential for further fine-tuning. More importantly, the in-house fine-tuned ASR model kept the same score when evaluated on a variety of different validation datasets as when it was in its original pretrained state.
Summary
For effectively automating customer interactions over the phone, fully customized ASR models play a pivotal role in solving the challenges of the voice channel, including background noise, poor call quality, and various dialects and accents. Dive deeper into PolyAI’s ASR transformative journey and explore the possibilities of speech AI and NVIDIA Riva by checking out Speech AI Day sessions.
PolyAI, part of NVIDIA Inception, provides a customer-led conversational platform for enterprises. To reimagine customer service with a best-in-class voice experience, see PolyAI’s product and sign up for a free trial. Join the conversation on Speech AI in the NVIDIA Riva forum.