Agentic AI / Generative AI

How Using a Reranking Microservice Can Improve Accuracy and Costs of Information Retrieval

Applications requiring high-performance information retrieval span a wide range of domains, including search engines, knowledge management systems, AI agents, and AI assistants. These systems demand retrieval processes that are accurate and computationally efficient to deliver precise insights, enhance user experiences, and maintain scalability. Retrieval-augmented generation (RAG) is used to enrich results, but its effectiveness is fundamentally tied to the precision of the underlying retrieval mechanisms.

The operational costs of RAG-based systems are driven by two primary factors: compute resources and the cost of inaccuracies resulting from suboptimal retrieval precision. Addressing these challenges requires optimizing retrieval pipelines without compromising performance. A reranking model can help improve retrieval accuracy and reduce overall expenses. However, despite the potential of reranking models, they have historically been underutilized due to concerns about added complexity and perceived marginal gains in information retrieval workflows.

In this post, we unveil significant performance advancements in the NVIDIA NeMo Retriever reranking model, demonstrating how it redefines the role of computing relevance scores in modern pipelines. Through detailed benchmarks, we’ll highlight the cost-performance trade-offs and showcase flexible configurations that cater to diverse applications, from lightweight implementations to enterprise-grade deployments.

What is a reranking model?

A reranking model, often referred to as a reranker or cross-encoder, is a model designed to compute a relevance score between two pieces of text. In the context of RAG, a reranking model evaluates the relevance of a passage to a given query. Unlike approaches that just use an embedding model, which generates independent semantic representations for each passage and relies on heuristic similarity metrics (cosine similarity, for example) to determine relevance, a reranking model directly compares the query-passage pair within the same model. This creates semantic representation one passage at a time, and then uses a heuristic metric to measure relevance. A reranking model evaluates the relevance of a passage to a given query. 

By analyzing the patterns, context, and shared information between the query and passage simultaneously, reranking models provide a more nuanced and accurate assessment of relevance. This makes cross-encoders more accurate at predicting relevance than using a heuristic score with an embedding model, making them a critical component for high-precision retrieval pipelines.

Graphic showing that embedding models generate a semantic representation of text that can then be used to calculate similarity by measuring the distance between two vectors. Reranking models implicitly generate a similarity score.
Figure 1. A high-level conceptual view of how the embedding model and reranking model calculate semantic similarity

Generating a relevance score for every query passage pair across an entire corpus using a cross-encoder is computationally expensive. To address this, cross-encoders are typically employed in a two-step process (Figure 2). 

In the first step, an embedding model is used to create a semantic representation of the query, which is then used to narrow down potential candidates from millions to a smaller subset, typically tens of passages. In the second step, the cross-encoder model processes these shortlisted candidates, reranking them to produce a final, highly relevant set–often just five passages. This two-stage workflow balances efficiency and accuracy, making cross-encoders invaluable as reranking models. 

Graphic showing use of the embedding model to select candidates from the entire vector database. These candidates are reranked by a reranking model to obtain the most relevant chunks.
Figure 2. General two-step workflow for using an embedding model and reranking model together in a RAG pipeline

How can reranking models improve RAG?

The cost of compute to run a large language model (LLM) is considerably higher when compared to using an embedding or reranking model. This cost scales directly with the number of tokens an LLM processes. A RAG system uses a retriever to fetch the top N chunks of relevant information (which can typically range from 3-10), and then employ an LLM to generate an answer based on that information. Increasing the value of N often involves a trade-off between cost and accuracy. A higher N improves the likelihood that the retriever includes the most relevant chunk of information, but it also raises the computational expenses of the LLM step. 

Retrievers typically rely on embedding models, but incorporating a reranking model into the pipeline offers three potential benefits:

  • Maximize accuracy while reducing the cost of running RAG just enough to offset the reranking model.
  • Maintain accuracy while considerably reducing the cost of running RAG.
  • Improve accuracy and reduce the cost of running RAG.

One may ask how a reranking model can be used to achieve these outcomes? The key lies in the efficient use of the two-step retrieval process. Increasing the number of candidates used in the second step for reranking enhances accuracy. However, this also increases the costs incurred, albeit marginally compared to the LLM. To put the magnitude into perspective: a Llama 3.1 8B model costs roughly 75x more to process five chunks and generate an answer, versus the NeMo Retriever Llama 3.2 reranking model, built with NVIDIA NIM microservices.

Reranking model stats

With the premise understood, this section dives into the performance benchmarks. There are three numbers that need to be understood to digest the information following:

  • N_Base: The number of chunks a RAG pipeline uses without a reranking (Base Case).
  • N_Reranked: The number of chunks a RAG pipeline uses with a reranking. 
  • K: The number of candidates being ranked in Step 2 using a reranking process.

With these three variables, formulate three equations that serve as the basis of all the three scenarios:

  • Equation 1: N_Reranked <= N_Base
  • Equation 2: RAG_Savings = LLM_Cost(N_Base) – ( Reranking_Cost(K) + LLM_Cost(N_Reranked))
  • Equation 3: Accuracy_Improvement = Reranking_Accuracy_Boost(K) + Accuracy(N_Reranked) – Accuracy(N_Base)

Maximize accuracy while reducing the cost of running RAG just enough to offset the reranking model

The goal of this scenario is to maximize the accuracy improvements, while getting the RAG savings to zero. So Equation 2 needs to maximize K, maximize N_Reranked, and for a given N_Base. These maximizations need to be done by respecting Equation 3 and setting RAG_Savings to 0 in Equation 1. 

Plugging in the values from NVIDIA NIM gives the results summarized in Figure 3. Base Accuracy is accuracy of the pipeline with N_base number of chunks and Improved Accuracy is accuracy of the pipeline by using N_base-1 chunks and a reranking model. 

Bar chart comparing Base Accuracy and Improved Accuracy showing that adding a reranking model improves accuracy across the board for a wide range of chunks for Llama 3.1 70B model.
Figure 3. The accuracy of retrieval systems with and without the reranking model for a RAG pipeline that uses a Llama 3.1 70B model

Maintaining accuracy while reducing the cost of running RAG

The goal of this scenario is to maximize the cost savings while not affecting the accuracy detrimentally. Look at Equation 1. To maximize RAG savings, for a given N_Base, we need to minimize K and N_Reranked. To do this, set the accuracy improvement to 0 and balance K and N_Reranked to match accuracy when working with N_Base chunks. Balancing these variables gives the results shown in Figure 4.

Bar chart showing that adding a reranking model reduces the cost of RAG by reducing the number of chunks.
Figure 4. Cost of running a RAG pipeline for a Llama 3.1 70B model that uses N_Base chunks can be reduced by reducing the number of chunks used and making up for the accuracy loss by using a reranking model

Improving accuracy and reducing the cost of running RAG

The previous two scenarios can be considered two extremes on a slider scale. One extreme is about maximizing cost reduction, and the other extreme is about maximizing accuracy increase. Users can choose to increase or decrease the number of chunks to reduce and the number of chunks to rerank to balance between the two extremes.

Upgrade your RAG system with NVIDIA NeMo Retriever

Reranking models are not just an optional enhancement, but a transformative addition to RAG pipelines, unlocking new levels of efficiency and precision. The NVIDIA NeMo Retriever reranking NIM microservices redefine the paradigm by delivering significant benefits across cost reduction and accuracy improvement. Benchmarks reveal a remarkable 21.54% cost savings. 

The flexibility of reranking model configurations enables developers to strike the ideal balance between cost efficiency and performance gains, catering to diverse use cases and scalability demands across any organization. The benefits are primarily driven by reducing the generation cost of RAG. That cost reduction is driven by reducing the number of input tokens that the LLM has to process to generate an answer.

These results challenge the outdated perception of reranking models as marginal improvements with added complexity, showcasing their essential role in optimizing modern machine learning workflows. 

To get started with this NeMo Retriever Llama 3.1 reranking NIM microservice and upgrade your RAG system today, try it on build.nvidia.com. You can also access the NVIDIA AI Blueprint for RAG as a starting point for building your own pipeline, using embedding and reranking models built with NVIDIA NIM.

Join us for NVIDIA GTC 2025 to explore the latest techniques for building retrieval pipelines and agentic workflows that can uncover fast, accurate insights within your data. Check out these related sessions: 

Discuss (0)

Tags