Generative AI

Develop Multilingual and Cross-Lingual Information Retrieval Systems with Efficient Data Storage

Efficient text retrieval is critical for a broad range of information retrieval applications such as search, question answering, semantic textual similarity, summarization, and item recommendation. It also plays a pivotal role in retrieval-augmented generation (RAG), a technique that enables large language models (LLMs) to access external context without modifying underlying parameters.  

While RAG is highly effective at improving the quality of responses generated by LLMs, many embedding models still struggle to retrieve the correct data across multiple languages due to being trained on predominantly English datasets. This limits the generation of accurate and informative text responses in other languages, hindering effective communication with a global audience. 

Multilingual information retrieval enhances the factual accuracy and coherence of generated text and enables localized, context-aware responses that bridge language barriers and make information more accessible worldwide. This capability unlocks diverse applications across industries, from improving clinician-patient communication and troubleshooting technical issues to delivering personalized retail experiences.

However, creating such systems for large-scale data platforms comes with unique challenges, such as managing massive data volumes, ensuring low-latency retrieval, and maintaining high accuracy across diverse and multilingual datasets. 

This post explains how you can address these complexities and build powerful multilingual information retrieval systems using NVIDIA NeMo Retriever embedding and reranking microservices. Built on NVIDIA NIM, NeMo Retriever enables seamless AI application deployment across diverse data environments. It redefines what’s possible for handling large-scale, multilingual retrieval with exceptional accuracy, scalability, and responsiveness, transforming how global organizations interact with information.

NVIDIA NeMo Retriever is a collection of microservices that provide world-class information retrieval with high accuracy and data privacy, enabling enterprises to generate real-time business insights. 

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, simplifies the deployment of generative AI models across platforms, enabling teams to self-host LLMs while offering standard APIs to build applications. For more information, see NVIDIA NIM for Developers

Multi-stage, multilingual information retrieval system requirements 

 Developing a multilingual information retrieval system involves integrating robust retrieval components capable of fetching data from a multilingual knowledge base. This retrieved data is then used to augment the generation process, ensuring accurate, context-aware responses. 

At the heart of information retrieval systems are embedding or dense retrieval models, which semantically encode queries and content (that is, passages or documents) into vector representations that capture their meaning.

In recent years, numerous dense embedding models of varying sizes and capabilities have been introduced (MTEB retrieval leaderboard). However, the majority of these models are limited in their ability to perform multilingual retrieval effectively. 

To build a multilingual RAG system, embedding models must support a wide range of languages, ensuring that queries and context from diverse linguistic sources are accurately embedded into a shared semantic space. 

For more advanced multilingual retrieval systems, a multi-stage multilingual retrieval pipeline may be necessary. This includes not only the dense retriever but also a reranking model that refines the results by ranking retrieved documents with greater accuracy across languages.

Revolutionizing data platforms with NVIDIA NeMo Retriever  

Recognizing the challenges and requirements of building these pipelines, NVIDIA introduced two new community-based NeMo Retriever microservices for world-class multilingual and cross-lingual text retrieval that are built on NVIDIA NIM.  

In addition to enabling multilingual and cross-lingual question-answering retrieval, the new multilingual models also address critical challenges in storage, performance, and adaptability for data platforms with efficiency and scale.

The following techniques enable more data to be stored in the vector database, enhancing real-time retrieval and generation capabilities:

  • Long-context support: Processes and understands extensive documents with support for contexts of up to 8192 tokens, improving data handling.
  • Dynamic embedding sizing: Offers flexible embedding sizes to optimize storage and retrieval processes, reducing dimensions while maintaining accuracy.
  • Storage efficiency: Reduces embedding dimensions to 384 and extends context length, cutting storage volume by 35x, enabling larger knowledge bases to fit on a single server.
  • Performance optimization: Combines long-context support with reduced embedding dimensions to deliver high accuracy while maintaining exceptional storage efficiency.
A bar chart shows document chunks of 4096 token length vs 300 token length and a reduced embedding dimension.
Figure 1. llama-3.2-nv-embedqa-1b-v2 impact on vector storage volume with long context support, dynamic embeddings, and efficient storage for high-performance, scalable data processing.

Figure 1 shows a reduced storage footprint of 35x through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently. This advancement is particularly beneficial for on-premises customers who cannot use cloud autoscaling, enabling them to store and retrieve more data accurately and efficiently.

Multilingual, cross-lingual text retrieval benchmarks with optimized embedding and reranking models 

So how did we optimize these embedding and reranking models for multilingual and cross-lingual text question-answering retrieval tasks? 

  • Adapted meta-llama/Llama-3.2-1B as the base model, which is a decoder-only model, and converted it to an encoder model. The base Llama-3.2-1B model officially supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai languages and has been trained on a broader collection of languages than these eight supported languages. 
  • Modified its self-attention mechanism from unidirectional (causal) to bidirectional so that, for each token, it is possible to attend to other tokens both on the right and left sides. 
  • Improved the base Llama-3.2-1B model’s existing multilingual capability by fine-tuning it with an internally curated blend of publicly available English and multilingual datasets. 
  • Fine-tuned both the embedding and reranking models with contrastive learning using the hard negatives mined with the positive-aware hard-negative mining methods. For more information, see NV-Retriever: Improving text embedding models with effective hard-negative mining

With the introduction of two new 1B-parameter retriever models, NVIDIA NeMo delivers a balance between high accuracy in multilingual retrieval and the need for efficient indexing throughput and low serving latency.

We evaluated our 1B-parameter retriever models on 18 MIRACL dev sets, 11 translated language datasets, and 49 cross-lingual MLQA datasets. All the models presented in the bar charts are evaluated on the same infrastructure and datasets. We subsampled MIRACL dev datasets for faster evaluation. Figure 2 shows that the NVIDIA Llama 3.2 embedding and reranking models excel in retrieval accuracy (measured by Recall@5), and even more so when they are combined into a multi-stage retrieval system.

A bar chart shows the 18 MIRACL dev sets and 11 translated datasets (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.
Figure 2.  Accuracy performance comparison of NeMo Retriever embedding microservices versus alternative embedders on 18 MIRACL dev sets and 11 translated datasets (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.

Figure 3 shows that both the NVIDIA Llama3.2 1B embedding and Llama3.2 1B reranking models demonstrate superior accuracy performance, leading to new state-of-the-art results for multilingual and cross-lingual text retrieval benchmarks.

A bar chart shows the 42 MLQA crosslingual test datasets (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.
Figure 3. Accuracy performance comparison of NeMo Retriever embedding microservices versus alternative embedders on the 42 MLQA crosslingual test datasets (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.

In addition to the NVIDIA Llama3.2 1B embedding and Llama3.2 1B reranking models multilingual and cross-lingual capabilities, Figure 4 shows that all the NVIDIA models generate more accurate retrieval results than alternatives on English-only TextQA benchmark datasets as well. The models were evaluated in comparison to open and commercial retriever models on academic benchmarks for question-answering: NQ, HotpotQA, and FiQA (Finance Q&A) from the BeIR benchmark and TechQA dataset.

Accuracy performance comparison of NeMo Retriever embedding microservices versus alternative embedders on question-answering datasets FiQA, NQ, and HotpotQA from BEIR and TechQA (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.
Figure 4. Accuracy performance comparison of NeMo Retriever embedding microservices versus alternative embedders on question-answering datasets FiQA, NQ, and HotpotQA from BEIR and TechQA (measured by Recall@5). The far right bar is generated from a multi-stage llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 retrieval system.

To access performance benchmarks of all the microservices, see the Benchmarks section in the NVIDIA NeMo Retriever documentation.

Get started developing world-class information retrieval pipelines

To build a scalable, world-class informational retrieval system using the NeMo Retriever microservices, visit the NVIDIA API Catalog, our hosted environment. There, you can access a collection of microservices for retrieval that enable organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. The collection includes llama-3.2-nv-embedqa-1b-v2 and llama-3.2-nv-rerankqa-1b-v2.

NVIDIA Developer Program members can access NIM for free for research, development, and testing on a preferred infrastructure. You’ll be prompted to enter a personal or business email address to access different options for building with NIM.

You can also explore the NVIDIA generative AI examples on GitHub to learn how to integrate these microservices and write sample applications. Get a free hands-on NVIDIA LaunchPad lab for NeMo Retriever to try out the microservices and unlock enterprise data, or a RAG lab to build AI chatbots.

Discuss (0)

Tags