Generative AI / LLMs

Amdocs Accelerates Generative AI Performance and Lowers Costs with NVIDIA NIM

Amdocs NVIDIA NIMs

Telecommunications companies (telcos) are leveraging generative AI to increase employee productivity by automating processes, improving customer experiences, and optimizing network operations. 

Amdocs, a leading provider of software and services for communications and media providers, built amAIz, a domain-specific generative AI platform for telcos as an open, secure, cost-effective, and large language model (LLM)-agnostic framework. Amdocs is using NVIDIA DGX Cloud and NVIDIA AI Enterprise software to provide solutions based on commercially available LLMs, as well as domain-adapted models, enabling service providers to build and deploy enterprise-grade generative AI applications. 

Amdocs is also using NVIDIA NIM, a set of easy-to-use inference microservices designed to accelerate the deployment of generative AI across enterprises. The versatile microservice supports open community models and NVIDIA AI Foundation models from the NVIDIA API catalog, as well as custom AI models. NIM is engineered to facilitate seamless AI inferencing with the highest throughput and lowest latency while preserving the accuracy of predictions. 

Customer billing use case 

In telco contact centers, billing inquiries represent a significant volume of customer calls. They seek explanations due to various operations that could affect their bill, including a customer’s mobile plan, the conclusion of a promotional period, or an unexpected charge. 

Amdocs is developing an LLM-based solution tailored to assist customers by providing immediate and accurate explanations for billing questions. The solution aims to reduce customer service agents’ workloads, enabling them to focus on more complex tasks. 

Figure 1 shows the overall process conducted from data collection and preparation to LLM-finetuning through parameter-efficient techniques to evaluation.

Diagram flows from anonymized billing care and anonymized customer agent transcripts through scenario and intent classification and filtering, bill care q&a generation, and annotations to create the billing care q&a dataset.
Figure 1. Overall process flow from data collection and preparation to LLM finetuning and evaluation

Data collection and preparation 

To tackle the problem, they created a new dataset from anonymized call transcripts and bills, labeled by telco customer service experts. The dataset contains a few hundred annotated questions and answers categorized into relevant scenarios. Most of the data was used for finetuning and the performance is reported on a small test set (a few dozen samples). 

Table 1 shows an example of the data collected. The question is related to billing changes, and the annotated answer is based on historical customer bills.

IDRelevant billsAccount IDAnnotated QuestionFinal Answer (‘label’)Annotated Scenario
id_1[‘id_12345.2310’, ‘id_12345.2311’]id_12345I noticed that my bill has increased recently. Can you explain why this happened?Your bill has increased from $100.02 in October to $115.02 in November primarily due to the expiration of promotional credits on your Internet services. Here are the details:
  – Your internet credit was reduced from -$75.00 in October to -$60.00 in November
Promotion expired
Table 1. Example of collected data for a mobile plan promotion expiration scenario

During the process, Amdocs used the OpenAI GPT-4 LLM as a tool for filtering the transcripts and categorizing transcripts into scenarios. Then, an LLM was used to generate potential question-answer pairs that were revisited and labeled by domain experts.

Data format and prompt engineering 

As a baseline, Amdocs used Llama2-7b-chat, Llama2-13b-chat, and Mixtral-8x7b LLMs to enhance a customer service chatbot with intent classification and bill Q&A capabilities. Amdocs designed prompts with instructions that include the targeted bills (one or two consecutive billing months in raw XML format) followed by the question. 

Initial experiments with baseline LLMs and zero-shot or few-shot inference underperformed mainly due to the complexity of extracting the relevant information from customer bills. In addition, the raw XML format required detailed instructions describing the billing format to the LLM. Consequently, Amdocs faced challenges in incorporating the bills and instructions in the prompts due to the limitations in the maximum context length of some LLMs (for example, 4K tokens for Llama2). 

To fit the context window, Amdocs’ first effort was dedicated to reducing the billing format instructions in the prompt. Figure 2 shows the average token reduction with the reformatted bill going from 3,909 tokens to 1,153 with the Llama2 tokenizer. 

Histogram shows the number of tokens per bill for a test data set consisting of “Initial Format” in Raw XML and “Reformatted” in “JSON/Markdown”.
Figure 2. Number of tokens reduced with the new billing format

LLM fine-tuning on NVIDIA DGX Cloud

Due to the limited volume of annotated data, Amdocs explored parameter-efficient finetuning (PEFT) methods, such as Low-Rank Adaptation (LoRA). They conducted several finetuning experiments with two foundation LLM architectures (Llama2 and Mixtral) exploring several LoRA hyperparameters for one to two epochs. 

Amdocs’ experiments were performed on NVIDIA DGX Cloud, an end-to-end AI platform for developers, offering scalable capacity built on the latest NVIDIA architecture and co-engineered with the world’s leading cloud service providers. Amdocs used NVIDIA DGX Cloud instances with the following components:

  • 8x NVIDIA A100 80GB Tensor Core GPUs
  • 88 CPU cores
  • 1 TB system memory

Finetuning cycles were performed on a multi-GPU setting, leading to less than an hour per cycle. 

LLM deployment with NVIDIA NIM

NVIDIA NIM builds on NVIDIA Triton Inference Server and uses TensorRT-LLM for optimized LLM inference on NVIDIA GPUs. NIM facilitates seamless AI inferencing with pre-optimized inference containers that operate out of the box, with the best possible latency and throughput on accelerated infrastructure while preserving the accuracy of predictions. Whether on-premises or in the cloud, NIM offers the following benefits:

  • Streamlines AI application development
  • Preconfigured containers for the latest generative AI models
  • Enterprise support with service-level agreements, and regular security updates for CVE
  • Support for latest community state-of-the-art LLMs
  • Cost efficiency and performance 

For this application, Amdocs used a self-hosted NVIDIA NIM instance to deploy finetuned LLMs. They exposed OpenAI-like API endpoints that enabled a uniform solution for their client application, which uses the LangChain ChatOpenAI client.

During finetuning exploration, Amdocs created a process that automates the deployment of the LoRA finetuned checkpoints with NIM. This process took about 20 minutes for the finetuned Mixtral-8x7B model. 

Results

Amdocs has seen multiple efficiencies with this process.

Accuracy improvements: The engagement with NVIDIA delivered noteworthy increases in the accuracy of AI-generated responses, improving the accuracy of responses by up to 30%. This type of improvement is pivotal for achieving widespread telco industry adoption and meeting the demands of direct-to-consumer generative AI services.

Using NVIDIA NIM, Amdocs achieved performance improvements in cost and latency.

Decreased costs to operate: Amdocs’ telecom retrieval-augmented generation (RAG) on NVIDIA infrastructure has enabled the reduction of tokens consumed for deployed use cases by as much as 60% in data preprocessing and 40% in inferencing, offering the same level of accuracy with a significantly lower cost per token, depending on various influences and volumes used.

Latency enhancements: The collaboration has successfully reduced query latency by approximately 80%, ensuring that end users experience near real-time responses. This acceleration enhances user experiences across commerce, care, operations, and beyond.

LLM accuracy evaluation

To evaluate performance across models and prompts on the test dataset during the finetuning phase, Amdocs used the high-level process in Figure 3.

Diagram shows the LoRA Finetuned LLM generating predictions, the LLM-as-a-Judge evaluating the predictions, and domain experts doing manual evaluation.
Figure 3. Evaluation process for LLMs including LLM-as-a-Judge and human experts

For each experiment, Amdocs first generated the LLM output predictions on the test dataset. 

Then, an external LLM-as-a-Judge was used to assess the predictions, providing metrics on accuracy and relevancy. Experiments meeting predefined criteria were subjected to automated regression tests to verify the accuracy of prediction details. The resulting score was a mix of metrics including the following:

  • F1 score
  • No Hallucinations Indicator
  • Accurate Conclusion Indicator
  • Answer Relevance
  • Dialog Coherence
  • No Fallback Indicator
  • Completeness
  • Toxicity

Finally, the best-performing models were evaluated manually to confirm overall accuracy. This process ensured that the finetuned LLMs were both effective and reliable. 

Figure 4 shows the overall accuracy score for different LLMs. Amdocs observed a 20–30% improvement in accuracy for the LoRA finetuned versions of Mixtral-8x7B and Llama2-13b-chat, respectively, compared to their base versions. When compared with a managed LLM service, the results also showed a 6% improvement in accuracy. 

Bar graph shows that Mixtral-8x7B-v01-LoRA achieved the highest score with 0.90 and Llama-2-13b-chat-base scored the lowest with 0.58.
Figure 4. Improvements for average score per model for top three performers

Token consumption

Reformatting the billing data resulted in a 60% reduction in input tokens. While the finetuned LLMs produced comparable or better performance, the models also led to approximately 40% additional savings in input tokens. This is attributed to the domain customization that minimized prompt instructions.

Figure 5 shows the comparison between token consumption of the Mixtral-8x7B, Llama2-13b, and managed LLM service. The difference in the number of input tokens is mainly due to the detailed instructions the managed LLM service required to perform well on the task.  For the domain-customized Llama2 and Mixtral-8x7B models, the reduction results from the continual context format improvement.

Bar graph shows that Mixtral-8x7B-v01-LoRA used the least average tokens with 2,217.28 and the managed LLM service used the most average tokens with 3807.91.
Figure 5. Token consumption for Mixtral-8x7B, Llama2, and a managed LLM service

LLM latency 

During the evaluation of the deployed models on A100 80GB GPUs using NVIDIA NIM, Amdocs observed on average 4-6x faster inference, or approximately 80%, better than the leading state-of-the-art managed LLM service. 

Figure 6 shows latency experiments performed using single-LLM calls and calculates the average latency for full generation cycles. All NIMs were deployed remotely, on DGX Cloud A100-powered instances. The Llama2-13b model was deployed on a single GPU, while Mixtral-8x7B was deployed on two GPUs. Response latency is more consistent when using self-hosted endpoints, as shown by the 0.95 Confidence Interval line shown in Figure 6. 

Bar graph shows that Mixtral-8x7B-v01-LoRA achieved the lowest average latency with 4.70 and the managed LLM service had the highest average latency with 33.34.
Figure 6. Average latency (in seconds) per model

Conclusion and next steps

NVIDIA NIM inference microservices improved latency, enabling faster processing within Amdocs’ applications. By optimizing data format and fine-tuning LLMs, Amdocs enhanced the accuracy of its billing Q&A system while significantly reducing its costs. Throughout this journey, Amdocs faced different challenges that required creative data reformatting, prompt engineering, and model-specific customizations. Defining a clear strategy for model evaluation and rigorous testing were key to their success.

Amdocs is taking the next step to create model customizations for different applications by using Multi-LoRA, a technique that enables dynamic loading of multiple model adaptations during inference. This approach optimizes memory usage, as only the base model is consistently loaded while model layer adaptations are dynamically loaded as needed.

By collaborating with NVIDIA, Amdocs kickstarted its strategy to integrate generative AI into its core portfolio, starting with identifying application areas, making generative AI capabilities more user-friendly through UX redesign, and prioritizing fast engineering. Amdocs will continue to use NVIDIA DGX Cloud and NVIDIA AI Enterprise software to customize LLMs with telco taxonomy to further increase accuracy and optimize the costs of generative AI training and inference.

Amdocs plans to continue the integration of generative AI into the amAIz platform in multiple strategic directions.

  • Enhancing customer query routing using AI-driven analysis of language and sentiment.
  • Enhancing the reasoning capabilities of their AI solutions to provide suggestions tailored to the specific needs of customers.
  • Address complex scenarios that demand extensive domain knowledge, multimodal, and multi-step solutions, such as network diagnostics and optimization.

These strategies will enable more efficient and effective operations and innovation.

For more information, watch the on-demand The Power of ‘What If?’: Delivering Business Value with Generative AI GTC session. 

Learn more about the amAIz platform and Amdocs generative AI solutions.

Get started with NVIDIA NIM to run and deploy the latest community-built generative AI models with APIs optimized and accelerated by NVIDIA.

Discuss (0)

Tags