Generative AI

Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM

Two individuals looking at a computer in a telco network operations center.

Telecom companies are challenged with consistently meeting service level agreements (SLAs) for end customers that ensure network quality of service. This includes quickly troubleshooting network devices with complex issues, identifying root causes, and resolving issues efficiently at their network operations centers (NOCs).

Current network troubleshooting and repair processes are often time-consuming, error-prone, and lead to prolonged network downtime, which negatively impacts operational efficiency and customer experience.

To address these issues, Infosys built a generative AI solution using NVIDIA NIM inference microservices and retrieval augmented generation (RAG) for automated network troubleshooting. The solution streamlines NOC processes, minimizes network downtime, and optimizes network performance.

Building smart network operations centers with generative AI

Infosys is a global leader in next-generation digital services and consulting with over 300K employees around the world. The Infosys team built a smart NOC, a generative AI customer engagement platform designed for NOC operators, chief network officers (CNOs), network administrators, and IT support staff. 

The RAG-based solution uses an intelligent chatbot to support NOC staff with digitized product information for network equipment and assists with troubleshooting network issues by quickly providing essential, vendor-agnostic router commands for diagnostics and monitoring. This reduces mean time to resolution and enhances customer service.

Challenges with vector embeddings and document retrieval

Infosys faced several challenges when building the chatbot for a smart NOC. These included balancing high accuracy and low latency for the underlying generative AI model, as the highest accuracy could result in additional latency for the model to further rerank retrieved vector embeddings during a user query.

In addition, addressing network-specific taxonomy, changing network device types and endpoints, and complex device documentation made it difficult to create a reliable, user-friendly solution.

The time-consuming nature of vector embedding processes on CPUs can significantly affect the user experience, particularly during extended job runs. This can potentially lead to delays and frustrations.

Using LLMs for inference through an API revealed a notable uptick in latency, a factor that inherently amplifies the overall processing time and warrants attention for optimization.

Data collection and preparation

To solve these challenges, Infosys built a vector database of network devices–specific manuals and knowledge artifacts—such as training documents and troubleshooting guides—to build contextual responses to user queries. Their initial focus included Cisco and Juniper Networks devices. Embeddings were created using embedding models, customized chunk sizes, and other fine-tuned parameters to populate the vector database.

The workflow diagram shows a user inputting a query to a generative AI application, which results in a query embedding sent to a vector database populated with document embeddings from an enterprise’s data. Documents are retrieved and ranked, and then the best-fit document and response are sent back to the user.
Figure 1. Data preprocessing pipeline for a basic retrieval augmented generation workflow

Solution architecture

Infosys balanced the following considerations and goals for their solution architecture:

  • User interface and chatbot: Develop an intuitive interface using React for creating customized chatbots, tailored to the workflow and advanced query scripting options, and displaying the response from NVIDIA NIM using a Llama 3 70B model.
  • Data configuration management: Provide flexible settings for chunking and embedding using NVIDIA NeMo Retriever embedding NIM (NV-Embed-QA-Mistral-7B). This enables users to define parameters like chunk size and overlap and select from various embedding models for optimal performance and control data ingestion.
  • Vector database options: Implement the ability to choose between different vector databases, such as FAISS for high-speed retrieval for efficient data retrieval, ensuring flexibility, efficiency, and consistency responsiveness.
  • Backend services and integration: Create robust backend services for chatbot management and configuration, including a RESTful API for integration with external systems, and ensure secure authentication and authorization.
  •  Integration with NIM: Integrate NIM microservices to improve the accuracy, performance, and cost of inference.
  • Configuration:
    • 10 NVIDIA A100 80-GB GPUs with eight NVIDIA A100 GPUs running NIM
    • Two A100 GPUs running NeMo Retriever microservices
    • 128 CPU cores
    • 1 TB storage
  • Guardrails: Use NVIDIA NeMo Guardrails, an open-source toolkit for easily adding programmable guardrails to LLM-based conversational applications and protecting against vulnerabilities.
Workflow diagram shows a user icon interacting with a generative AI chatbot, which uses NVIDIA NeMo Guardrails to align the prompt, NVIDIA NeMo Retriever microservices to generate vector embeddings and rerank retrieved documents, and NVIDIA NIM to send an accurate, safe, and quick response back to the user.
Figure 2. Workflow for a user prompting a generative AI chatbot and the backend RAG pipeline to provide a fast and accurate response

AI workflow with NVIDIA NIM and NeMo Guardrails

To build the smart NOC, Infosys used a self-hosted instance of NVIDIA NIM and NVIDIA NeMo to fine-tune and deploy foundational LLMs. The team used NIM to expose OpenAI-like API endpoints that enabled a uniform solution for their client application.

Infosys used NeMo Retriever to power their vector database retrieval and reranking workflows. NeMo Retriever is a collection of microservices that present a single API for indexing and querying user data–enabling enterprises to seamlessly connect custom models to diverse business data and deliver highly accurate responses. For more information, see Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever.

Using NeMo Retriever, powered by the NV-Embed-QA-Mistral-7B NIM, Infosys achieved over 90% accuracy on their text embedding model.

NV-Embed-QA-Mistral-7B ranks first on the Massive Text Embedding Benchmark (MTEB), excelling across 56 tasks, including retrieval and classification. This model’s innovative design enables NV-Embed to attend to latent vectors for better pooled embedding outputs and employs a two-stage instruction tuning method to enhance accuracy.

Bar graph showing accuracy comparisons for two embedding models. NV-Embed-QA-Mistral-7B achieved over 90% accuracy for text embeddings, outperforming All-MPNET-Base-v.
Figure 3. NV-Embed-QA-Mistral-7B embedding model performance

Infosys used NeMo Retriever reranking NIM (Rerank-QA-Mistral-4B), which refines the retrieved context from the vector database with respect to the query. This step is crucial when retrieved contexts come from various datastores with differing similarity scores. The reranker is based on a fine-tuned Mistral 7B model, uses 7B parameters, and enhances efficiency without sacrificing performance.

The bar graph shows accuracy comparisons for nv-rerank-qa_v1 compared to a base model without reranking and using Mistral 7B.
Figure 4. The nv-rerank-qa_v1 reranker model improves accuracy

Using the NV-Embed-QA-Mistral-7B model boosted accuracy by 19% (to 89% from 70%) on the baseline model, leading to an overall improvement in performance during response generation. Using the nv-rerank-qa_v1 reranking model improved accuracy by over 2%. Adding the NeMo Retriever reranking model to the RAG pipeline improved LLM response accuracy and relevance.

Results

Latency and accuracy are two key factors to evaluate the performance of LLMs. Infosys measured both factors with results for baseline models compared to models deployed using NVIDIA NIM.

LLM latency evaluation

Infosys measured LLM latency to compare results with and without using NVIDIA NIM (Table 1).

Without NIM, the LLM latency for Combo 1 was measured at 2.3 seconds. Using NIM to deploy a Llama 3 70B model with NeMo Retriever embedding and reranking microservices, the LLM latency achieved for Combo 5 was 0.9 seconds—an improvement of nearly 61% compared to the baseline model.

 Without NIM With NIM 
 Combo 1Combo 2 Combo 3Combo 4 Combo 5
Latency (sec)2.31.91.11.30.9
Table 1. Latency comparison for LLMs
The bar graph shows latency comparison with two setups not using NVIDIA NIM and three setups using NVIDIA NIM. NIM improves LLM latency by nearly 61%.
Figure 5. Latency comparison for five different LLMs

LLM accuracy evaluation

Infosys measured LLM latency for a smart NOC to compare results with and without NIM (Table 2).

When comparing the same model, Infosys achieved LLM accuracy of up to 85% without NIM and 92% with NeMo Retriever embedding and reranking NIMs—an absolute improvement of 22% compared to the base model. This demonstrates the effectiveness of NVIDIA NIM in optimizing the accuracy of RAG systems, making it a valuable enhancement for achieving more accurate and reliable model outputs.

 NIM OFFNIM ON  
 Combo 1 Combo 2 Combo 3 Combo 4 Combo 5
Framework LangChain Llama-index LangChain LangChain LangChain 
Chunk size, chunk overlap 512,100 512,100 512,100 512,100 512,100 
Embedding model All-mpnet-base-v All-MiniLM-L6-v2 NV-Embed-QA-Mistral-7B NV-Embed-QA-Mistral-7B NV-Embed-QA-Mistral-7B 
Rerank model No No No nv-rerank-qa_v1 nv-rerank-qa_v1 
TRT-LLM No No Yes Yes Yes 
Triton No No Yes Yes Yes 
Vector DB Faiss-CPU Milvus Faiss-GPU Faiss-GPU Faiss-GPU 
LLM Ollama (Mistral 7B) Vertex AI (Cohere-command)NIM LLM (Mistral-7B) NIM LLM
(Mistral-7B) 
NIM LLM
(Llama-3 70B) 
Accuracy 70% 85% 89% 91% 92%
Table 2. Accuracy comparison for generative AI models
The bar graph shows the accuracy comparison with two setups not using NVIDIA NIM and three setups using NVIDIA NIM. NIM improves absolute LLM latency by 22%.
Figure 6. Accuracy comparison for five different LLMs

Conclusion

By using NVIDIA NIM and NVIDIA NeMo Retriever microservices to deploy its smart NOC, Infosys lowered LLM latency by 61% with an absolute improvement in accuracy by 22%. NeMo Retriever embedding and reranking microservices, deployed on NIM, enabled these gains through optimized model inference.

The integration of NeMo Retriever microservices for embedding and reranking significantly improved RAG relevance, accuracy, and performance. Reranking enhances contextual understanding, while optimized embeddings ensure accurate responses. This integration enhances user experience and operational efficiency in network operation centers, making it a crucial component for system optimization.

Learn how Infosys eliminates network downtime through automated workflow, powered by NVIDIA.

Get started deploying generative AI applications with NVIDIA NIM and NeMo Retriever NIM microservices. Explore more AI solutions for telecom operations.

Discuss (0)

Tags