How to Enhance RAG Pipelines with Reasoning Using NVIDIA Llama Nemotron Models

A key challenge for retrieval-augmented generation (RAG) systems is handling user queries that lack explicit clarity or carry implicit intent. Users often phrase questions imprecisely. For instance, consider the user query, “Tell me about the latest update in NVIDIA NeMo model training.” It’s possible that the user is implicitly interested in advancements in NeMo large language model (LLM) customization features, rather than its speech models. However, this preference is not expressed explicitly, which can potentially lead to suboptimal results.

Overcoming these limitations and unlocking the true potential of RAG requires moving beyond basic techniques. This post introduces the AI reasoning capabilities of NVIDIA Nemotron LLMs that significantly enhance RAG pipelines. We walk through a real-life example of how we applied advanced strategies, such as query analysis and rewriting, to refine the search capabilities of a query engine.

What is query rewriting in RAG?

Query rewriting in RAG is a crucial step that transforms a user’s initial prompt into a more optimized query to improve information retrieval. This process is vital for boosting RAG performance because it bridges the semantic gap between how a user asks a question and how information is structured in the knowledge base. By refining the query, the system can overcome issues like vagueness or excessive complexity, leading to the retrieval of more precise and relevant documents. This higher-quality context directly enables the language model to generate more accurate, comprehensive, and factually grounded answers.

Several techniques have emerged for effective query rewriting, particularly leveraging LLMs:

Q2E (Query2Expand): Generates semantically equivalent queries or expansions that cover different ways the user’s information might be expressed, increasing the chance of retrieving relevant documents.
Q2D (Query2Doc): Constructs a pseudo-document from the original query, reflecting the style and content of retrieval passages. This improves alignment with how information is stored in the corpus.
CoT (chain-of-thought) query rewriting: This method uses a specific prompt instructing the LLM to provide a step-by-step rationale, breaking down the original query and elaborating on related context before giving the expanded query. Unlike rewriting the query directly, the method prompts generate verbose, logical explanations that tend to include a broad set of relevant keywords naturally embedded in the reasoning.

By employing such techniques, RAG systems can restructure poorly formed questions, introduce vital keywords, and anchor user queries more closely to the semantics of the corpus—substantially elevating both search and answer quality.

To incorporate the query rewriting technique into RAG, the prompt would need to be tailored specifically to the RAG use cases. Check out a few sample prompts for each method:

Q2E prompt

Your task is to brainstorm a set of useful search terms and related key phrases that could help locate information 
about the following question. Focus on capturing alternate expressions, synonyms, and specific entities or events 
mentioned in the query.
     Original Question: {query}
     Related Search Keywords:

Q2D prompt

Imagine you are composing a short informative article that directly addresses a given question. Write a detailed passage 
that would help someone fully understand the subject or find an answer to the query.
     Query: {query}
     Passage:

CoT query rewriting prompt

Please carefully consider the following question. First, break down what the question is asking and think through any 
relevant facts, possible interpretations, or required background knowledge. Then, list out important words, concepts, or 
phrases that emerge from your reasoning process, which could help retrieve detailed answers.
     Question: {query}
     Your step-by-step reasoning and expansion terms:

How do NVIDIA Nemotron models advance RAG?

The NVIDIA Nemotron family of reasoning and multimodal models builds on the Meta Llama family to deliver a suite of LLMs optimized for efficiency, performance, and advanced applications like RAG and agentic systems. Nemotron models are an open family of advanced AI models designed to deliver strong reasoning capabilities, high efficiency, and flexible deployment for enterprise AI agents. Available in Nano, Super, and Ultra sizes, these models combine Meta Llama architecture with NVIDIA extensive post-training techniques to achieve top accuracy on industry benchmarks.

Among the Nemotron family of models, we found that the Llama 3.3 Nemotron Super 49B v1 model suits the use case for driving advancements in RAG the most, particularly considering inference latency and appropriate reasoning ability. Results on the Natural Questions (NQ) dataset clearly show query rewriting significantly improves accuracy of retrieval. Accuracy@K indicates the fraction of questions where a correct answer is found in the top-K retrieved passages.

NQ (Natural Question) dataset	Accuracy@10	Accuracy@20
Original query	43.1%	58.3%
COT query rewriting with Llama 3.3 Nemotron Super 49B v1	63.8%	74.7%

Table 1. Retrieval performance comparison between original query and query rewritten by Llama Nemotron on NQ dataset using BM25 as reranker

Architecture for RAG pipeline with Llama Nemotron

Figure 1 shows the architecture of an enhanced RAG process with Llama 3.3 Nemotron Super 49B v1.

Enhanced RAG pipeline with Llama Nemotron, NeMo Retriever, and Slack components. — *Figure 1. Enhanced RAG pipeline using the Nemotron reasoning model and NVIDIA NeMo Retriever*

In the architecture, the Llama Nemotron model is used as a query extractor with the following functions:

Analyze user query to extract the core query. This step refines the user query to exclude unnecessary and distracting phrases, which are likely to adversely affect the retrieval result.
Analyze user query to extract available filtering or ranking criterion. The extracted filtering criterion can be used for hybrid retrieval search or as input to a reranking model to perform qualitative filtering. The extracted ranking criterion enables users to define other ranking criteria except for relevancy.
Expand the core query by adding related contextual information. This process can include techniques such as generating paraphrases, breaking complex queries into sub-questions, or appending background context. Expanding queries in this manner is beneficial as it improves recall and retrieval accuracy, especially when user queries are ambiguous or incomplete.
Pass the expanded query to NVIDIA NeMo Retriever for accelerated ingestion, embedding, and reranking.

Slack is integrated with the backend with Slack to enable integration with additional applications and to eliminate the need for developing and maintaining a traditional frontend. Several key components ensure seamless communication between the Slack users and the backend, including:

Real-time event handling: SocketModeHandler enables real-time event handling, ensuring seamless communication between Slack users and the backend.
Modular bot setup: For loading components, connecting to core logic, and setting up event handlers and logs.
Organized interactive user experience: To enhance user experience by posting all replies as threaded messages to minimize clutter and keep conversations organized.

For the purposes of this post, the architecture shown in Figure 1 is applied to help improve search results for NVIDIA GTC 2025 sessions. Query rewrite ensures the semantic similarity search retrieves a more focused set of sessions. This is further explained with examples in the next section.

How to refine a search query engine with reasoning capabilities

A key challenge that highlights the necessity of query rewriting in a RAG workflow is the semantic gap between the users’ language and the vocabulary of the content. For instance, consider the user query, “Sessions for training an LLM for low-resourced language.” The challenge in this query is the phrase “low-resourced language.”

With this query, the user is looking for sessions about sessions related to training multilingual LLMs or Sovereign AI. While numerous GTC 2025 sessions discuss this topic, none of them use the key phrase “low-resourced language.” Rather, more common phrases include “multilingual,” “non-English,” “Sovereign AI,” or specific languages such as “Korean” or “French.” For this reason, using the original query to retrieve and rank the relevant sessions is not likely to produce a satisfying result.

To tackle this problem, we adopted the Q2E techniques to rewrite the queries. In this user case, Q2D and COT query rewriting are not appropriate because the user query will be domain specific and general purpose LLM lacks knowledge on creating pseudo documents or context to the user query, leading to a high chance of LLM hallucination. A sample Q2E prompt for this use case is shown below.

## Instruction
### Goal
You are given a user query about querying for GTC sessions. Your task is to determine what topic or particular sessions 
the user is looking for.
### Steps
1. You should first extract the major request from the user query.
    - Understand the main search target in the user query, make sure you know what the user is looking for
    - Pay attention to all the details or keywords that are relevant to the main search target and include them. 
Please note that it is possible that the user will place the relevant keywords anywhere in the query but not necessarily 
right next to the main search target. Please relate ALL relevant search keywords and complete the main search query.
    - Include ALL non-filter/non-ranking **descriptive phrases**  in `main_query` even if they don't match available 
criteria, but **Remove subjective descriptors** like "promising" in `main_query`
    - EXCLUDE ALL the filtering and ranking criteria
    - **Remove event references** (e.g., "GTC", "SIGGRAPH") from `main_query` even if they appear mid-phrase
2. Provide your understanding/explanation on the main query extracted.
  - Write **EXACTLY 1-3 sentences** describing ONLY what the sessions are about, based strictly on the literal words 
in `main_query`.
- Use this template:
  `"Sessions focused on [exact field from main_query]. These sessions typically discuss [general description of what 
such sessions typically cover, elaborating on all KEY PHRASES from the main_query. Where appropriate, briefly mention
common goals, benefits, or general approaches relevant to the topic, as long as they are directly related to the key 
phrases and align with common understanding in the field.]."`
- **Do NOT mention any specific techniques, challenges, industries, methods, or examples unless they are explicitly stated
in the main_query.**
- **Do NOT add or infer information that is not present or clearly implied in the main_query.**
- **Elaborate on each key phrase in the main_query, providing context or typical session content that aligns with standard
 interpretations in the AI/tech field.**
- **Ensure your explanation is clear, human-like, and aligns with normal human perception and expectations for such 
sessions.**
- **Do NOT include any preamble, reasoning, or formatting other than the explanation sentence(s).**
  - **Example**:
    - User query 1: "Sessions about enabling AI-recommended knowledge articles for customer service agents"
    - Explanation 1: "Sessions focused on enabling AI-recommended knowledge articles for customer service agents. 
These sessions typically discuss how AI can recommend relevant articles in real time to help agents resolve customer 
issues more efficiently."
    - User query 2: "Any sessions that introduce large language models (LLMs) and their applications?"
    - Explanation 2: "Sessions focused on introducing large language models (LLMs) and their applications. These sessions 
typically discuss what LLMs are, how they are developed, and their uses in tasks like text generation, translation, 
and summarization."
    - User query 3: "Sessions on AI ethics and societal impact in technology"
    - Explanation 3: "Sessions focused on AI ethics and societal impact in technology. These sessions typically discuss 
ethical considerations in AI development and the broader effects of AI technologies on society."

### Output
Output as the following JSON format
{{
    "main_query": "", // string of major requests from the user query. Be as concise as possible while capturing all 
the descriptive phrases.
    "main_query_explanation": "", // Understanding/explanation on what kind of sessions the user is looking for 
based on the main query
}}

## User query
{query}

## Your Final output
```json
{{
    YOUR OUTPUT
}}
```

For the sample query, “Sessions for training an LLM for low-resourced language,” query expansion can significantly increase the ranking of the most relevant sessions returned by a semantic similarity based retriever. Table 2 provides more details.

Original query: Sessions for training an LLM for low-resourced language.
Query expansion: Sessions focused on training an LLM for low-resourced language. These sessions typically discuss approaches to develop LLMs when there’s limited training data available for the language.
Session title	Ranking (original query)	Ranking (query expansion)
Knowledge Bridging: Building Compute-Efficient, Multilingual Frontier Models for Sovereign AI	20	7
Multi-Domain Large Language Model Adaptation Using Synthetic Data Generation	73	28
Building Generative AI for a Billion Indian Voices	56	51

Table 2. Comparison of ranking for a typical query, using the original query and query expansion as input

Moreover, query expansion helps the reranker to focus on a broader but still highly relevant scope during the ranking process. For example, the truncated logical thinking token of the Llama Nemotron model with a different query:

Original query: “The key phrases are ‘training,’ ‘LLM,’ and ‘low-resourced language’ “
Query expansion: “The key phrases are ‘low-resourced language,’ ‘limited training data,’ ‘multilingual,’ ‘domain adaptation,’ and so on”

Note that with query expansion, the reranker is better equipped to identify sessions that discuss related concepts, even if they don’t use the exact original query terms. This broader perspective enables the reranker to create a more comprehensive and user-centric ranking, surfacing sessions that provide a deeper understanding of the user’s overall information need.

What are the benefits of query rewriting?

By improving search results through query rewriting, the enhanced pipeline offers a compelling advantage over traditional approaches to RAG. The primary advantage comes from intelligently reformulating user queries. This adds crucial context and details. This step is responsible for creating a high-quality, highly relevant candidate pool, which is the biggest factor in the system’s improved performance.

What are the challenges of this approach?

Query rewriting requires AI inference, which is resource-intensive and slower than traditional methods, limiting scalability. Furthermore, LLMs can only process a limited number of documents at one time, necessitating sliding window strategies for large candidate sets. This increases complexity and can hinder global ranking quality.

When to optimize a RAG pipeline

This enhanced RAG pipeline is especially valuable in domains where accuracy and precision matter more than speed, as detailed in Table 3.

Use case	Benefit of enhancing RAG with reasoning
Legal document analysis	Query rewriting and expansion help to surface and rank the most relevant precedents for complex cases, improving the quality and reliability of legal analysis.
Clinical trial research	In medicine, clinicians can find and prioritize the most applicable research and guidelines for diagnostics or treatment planning, to support better patient outcomes.
Risk assessment and decision-making	Up-to-date, contextually relevant information is critical for risk assessment, compliance, and investment decisions.

Table 3. Optimizing a RAG pipeline is beneficial when accuracy is more important than speed

Get started enhancing your RAG pipelines

In this post, we introduced an innovative approach to improve RAG pipelines using the reasoning capabilities of the NVIDIA Llama Nemotron family of models. By addressing the limitations of traditional methods, this enhanced architecture enables more effective and user-centric information access, particularly in scenarios demanding high precision and nuanced understanding.

To learn more about the full capabilities of the Llama Nemotron collection of LLM models, see Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models. You can experiment with NVIDIA NIM models on the NVIDIA API Catalog. Further enhance and accelerate your RAG pipelines with NVIDIA NeMo Retriever and the NVIDIA RAG blueprint.