Agentic AI / Generative AI

Develop Multilingual and Cross-Lingual Information Retrieval Systems with Efficient Data Storage

Dec 17, 2024

By Ronay AK, Isabel Hulseman, Benedikt Schifferer and Nave Algarici

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA NeMo Retriever embedding and reranking microservices enable the development of powerful multilingual information retrieval systems, enhancing the accuracy and coherence of generated text across languages.
The NeMo Retriever microservices, including llama-3.2-nv-embedqa-1b-v2 and llama-3.2-nv-rerankqa-1b-v2, support long-context processing, dynamic embedding sizing, and storage efficiency, reducing storage volume by 35x.
When combined into a multi-stage retrieval system, the NVIDIA Llama 3.2 embedding and reranking models excel in retrieval accuracy, demonstrating superior performance on multilingual and cross-lingual text retrieval benchmarks.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Efficient text retrieval is critical for a broad range of information retrieval applications such as search, question answering, semantic textual similarity, summarization, and item recommendation. It also plays a pivotal role in retrieval-augmented generation (RAG), a technique that enables large language models (LLMs) to access external context without modifying underlying parameters.

While RAG is highly effective at improving the quality of responses generated by LLMs, many embedding models still struggle to retrieve the correct data across multiple languages due to being trained on predominantly English datasets. This limits the generation of accurate and informative text responses in other languages, hindering effective communication with a global audience.

Multilingual information retrieval enhances the factual accuracy and coherence of generated text and enables localized, context-aware responses that bridge language barriers and make information more accessible worldwide. This capability unlocks diverse applications across industries, from improving clinician-patient communication and troubleshooting technical issues to delivering personalized retail experiences.

However, creating such systems for large-scale data platforms comes with unique challenges, such as managing massive data volumes, ensuring low-latency retrieval, and maintaining high accuracy across diverse and multilingual datasets.

This post explains how you can address these complexities and build powerful multilingual information retrieval systems using NVIDIA NeMo Retriever embedding and reranking microservices. Built on NVIDIA NIM, NeMo Retriever enables seamless AI application deployment across diverse data environments. It redefines what’s possible for handling large-scale, multilingual retrieval with exceptional accuracy, scalability, and responsiveness, transforming how global organizations interact with information.

NVIDIA NeMo Retriever is a collection of microservices that provide world-class information retrieval with high accuracy and data privacy, enabling enterprises to generate real-time business insights.

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, simplifies the deployment of generative AI models across platforms, enabling teams to self-host LLMs while offering standard APIs to build applications. For more information, see NVIDIA NIM for Developers.

Multi-stage, multilingual information retrieval system requirements

Developing a multilingual information retrieval system involves integrating robust retrieval components capable of fetching data from a multilingual knowledge base. This retrieved data is then used to augment the generation process, ensuring accurate, context-aware responses.

At the heart of information retrieval systems are embedding or dense retrieval models, which semantically encode queries and content (that is, passages or documents) into vector representations that capture their meaning.

In recent years, numerous dense embedding models of varying sizes and capabilities have been introduced (MTEB retrieval leaderboard). However, the majority of these models are limited in their ability to perform multilingual retrieval effectively.

To build a multilingual RAG system, embedding models must support a wide range of languages, ensuring that queries and context from diverse linguistic sources are accurately embedded into a shared semantic space.

For more advanced multilingual retrieval systems, a multi-stage multilingual retrieval pipeline may be necessary. This includes not only the dense retriever but also a reranking model that refines the results by ranking retrieved documents with greater accuracy across languages.

Revolutionizing data platforms with NVIDIA NeMo Retriever

Recognizing the challenges and requirements of building these pipelines, NVIDIA introduced two new community-based NeMo Retriever microservices for world-class multilingual and cross-lingual text retrieval that are built on NVIDIA NIM.

NeMo Retriever Llama 3.2 embedding: llama-3.2-nv-embedqa-1b-v2
NeMo Retriever Llama 3.2 reranking: llama-3.2-nv-rerankqa-1b-v2

In addition to enabling multilingual and cross-lingual question-answering retrieval, the new multilingual models also address critical challenges in storage, performance, and adaptability for data platforms with efficiency and scale.

The following techniques enable more data to be stored in the vector database, enhancing real-time retrieval and generation capabilities:

Long-context support: Processes and understands extensive documents with support for contexts of up to 8192 tokens, improving data handling.
Dynamic embedding sizing: Offers flexible embedding sizes to optimize storage and retrieval processes, reducing dimensions while maintaining accuracy.
Storage efficiency: Reduces embedding dimensions to 384 and extends context length, cutting storage volume by 35x, enabling larger knowledge bases to fit on a single server.
Performance optimization: Combines long-context support with reduced embedding dimensions to deliver high accuracy while maintaining exceptional storage efficiency.

Figure 1 shows a reduced storage footprint of 35x through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently. This advancement is particularly beneficial for on-premises customers who cannot use cloud autoscaling, enabling them to store and retrieve more data accurately and efficiently.

Multilingual, cross-lingual text retrieval benchmarks with optimized embedding and reranking models

So how did we optimize these embedding and reranking models for multilingual and cross-lingual text question-answering retrieval tasks?

Adapted meta-llama/Llama-3.2-1B as the base model, which is a decoder-only model, and converted it to an encoder model. The base Llama-3.2-1B model officially supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai languages and has been trained on a broader collection of languages than these eight supported languages.
Modified its self-attention mechanism from unidirectional (causal) to bidirectional so that, for each token, it is possible to attend to other tokens both on the right and left sides.
Improved the base Llama-3.2-1B model’s existing multilingual capability by fine-tuning it with an internally curated blend of publicly available English and multilingual datasets.
Fine-tuned both the embedding and reranking models with contrastive learning using the hard negatives mined with the positive-aware hard-negative mining methods. For more information, see NV-Retriever: Improving text embedding models with effective hard-negative mining.

With the introduction of two new 1B-parameter retriever models, NVIDIA NeMo delivers a balance between high accuracy in multilingual retrieval and the need for efficient indexing throughput and low serving latency.

We evaluated our 1B-parameter retriever models on 18 MIRACL dev sets, 11 translated language datasets, and 49 cross-lingual MLQA datasets. All the models presented in the bar charts are evaluated on the same infrastructure and datasets. We subsampled MIRACL dev datasets for faster evaluation. Figure 2 shows that the NVIDIA Llama 3.2 embedding and reranking models excel in retrieval accuracy (measured by Recall@5), and even more so when they are combined into a multi-stage retrieval system.

Figure 3 shows that both the NVIDIA Llama3.2 1B embedding and Llama3.2 1B reranking models demonstrate superior accuracy performance, leading to new state-of-the-art results for multilingual and cross-lingual text retrieval benchmarks.

In addition to the NVIDIA Llama3.2 1B embedding and Llama3.2 1B reranking models multilingual and cross-lingual capabilities, Figure 4 shows that all the NVIDIA models generate more accurate retrieval results than alternatives on English-only TextQA benchmark datasets as well. The models were evaluated in comparison to open and commercial retriever models on academic benchmarks for question-answering: NQ, HotpotQA, and FiQA (Finance Q&A) from the BeIR benchmark and TechQA dataset.

To access performance benchmarks of all the microservices, see the Benchmarks section in the NVIDIA NeMo Retriever documentation.

Get started developing world-class information retrieval pipelines

To build a scalable, world-class informational retrieval system using the NeMo Retriever microservices, visit the NVIDIA API Catalog, our hosted environment. There, you can access a collection of microservices for retrieval that enable organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. The collection includes llama-3.2-nv-embedqa-1b-v2 and llama-3.2-nv-rerankqa-1b-v2.

NVIDIA Developer Program members can access NIM for free for research, development, and testing on a preferred infrastructure. You’ll be prompted to enter a personal or business email address to access different options for building with NIM.

You can also explore the NVIDIA generative AI examples on GitHub to learn how to integrate these microservices and write sample applications. Get a free hands-on NVIDIA LaunchPad lab for NeMo Retriever to try out the microservices and unlock enterprise data, or a RAG lab to build AI chatbots.

Discuss (0)

About the Authors

About Ronay AK
Ronay Ak is a senior data scientist at NVIDIA working on information retrieval for RAG applications. Before her current role, she focused on deep learning-based recommender systems and was one of the engineers building the NVIDIA Merlin framework. She received her PhD in Energy and Power Systems Engineering discipline from CentraleSupelec in France. She was part of the NVIDIA AI team that won the WSDM WebTour Workshop Challenge 2021 by Booking.com and SIGIR’21 E-commerce data challenge hosted by Coveo. She has authored 20+ technical publications published in internationally reputed conferences and journals and delivered numerous hands-on tutorials for academic and industry audiences as an NVIDIA DLI certified instructor.

View all posts by Ronay AK

About Isabel Hulseman
Isabel Hulseman is a product marketing manager at NVIDIA, where she focuses on enterprise AI software and the rapidly evolving field of agentic AI. With over five years at NVIDIA and an MBA in Marketing, she specializes in go-to-market strategy and translating complex AI technologies into clear, actionable value for developers and enterprise teams. Her work is centered on enabling organizations to build AI agents that are specialized, well-governed, and ready for real-world deployment.

View all posts by Isabel Hulseman

About Benedikt Schifferer
Benedikt Schifferer is a manager of an applied research team at NVIDIA working on information retrieval and LLMs. Previously, he researched Recommender Systems and was one of the engineers building the NVIDIA Merlin framework. Before his work at NVIDIA, he developed recommender systems for a German ecommerce company. He holds a Master of Science in Data Science from Columbia University, New York. Benedikt was part of the NVIDIA AI team that won the WSDM WebTour Workshop Challenge 2021 by Booking.com, ACM RecSys 2021, KDD Cup 2023 and 2024 competitions.

View all posts by Benedikt Schifferer

About Nave Algarici
Nave Algarici is an AI and deep learning product manager at NVIDIA, responsible for NeMo Retrieval and a member of the NeMo-LLM product team. Before joining NVIDIA, Nave managed the AI research and development at Voca, acquired by Snap. He holds an MBA from UC Berkeley Haas School of Business and master’s and bachelor’s degrees in Electrical Engineering from Ben Gurion University, Israel.

View all posts by Nave Algarici