如何使用 NVIDIA Llama Nemotron 模型通过推理增强 RAG 工作流

检索增强生成 (RAG) 系统面临的一大挑战是处理缺乏明确清晰度或带有隐含意图的用户查询。用户通常会以不准确的方式来表达问题。例如，考虑用户查询“告诉我 NVIDIA NeMo 模型训练的最新更新”，用户可能暗中对 NeMo 大语言模型 (LLM) 定制功能的进步感兴趣，而不是其语音模型。然而，这种偏好没有被明确表达，这可能会导致结果不理想。

要克服这些限制并充分发挥 RAG 的真正潜力，就需要超越基本技术。本文介绍了 NVIDIA Nemotron LLM 的 AI 推理功能，这些功能显著增强了 RAG 工作流。我们通过一个真实案例，展示了如何应用高级策略（例如查询分析和重写）来改进查询引擎的搜索功能。

RAG 中的查询重写是什么？

RAG 中的查询重写是将用户的初始提示转换为更优化的查询以改进信息检索的关键步骤。这一过程对于提高 RAG 性能至关重要，因为它弥合了用户提问方式与知识库中信息结构之间的语义差距。通过优化查询，系统可以克服模糊或过于复杂等问题，从而检索出更准确、更相关的文档。这种高质量的上下文可以直接使语言模型生成更准确、更全面、更符合事实的答案。

已经出现了几种有效的查询重写技术，特别是利用 LLM：

Q2E (Query2Expand) ：生成语义等效的查询或扩展，涵盖用户信息可能表达的不同方式，从而提高检索相关文档的可能性。
Q2D (Query2Doc) ：根据原始查询构建伪文档，反映检索段落的风格和内容。这可以更好地与语料库中的信息存储方式保持一致。
CoT（思维链）查询重写：这种方法使用特定的提示来指导 LLM 提供逐步的推理，分解原始查询并在给出扩展查询之前详细说明相关上下文。与直接重写查询不同，该方法提示生成冗长的逻辑解释，其中往往包含一系列自然嵌入推理中的相关关键字。

通过采用这些技术，RAG 系统可以重构结构不佳的问题，引入重要的关键词，并将用户查询与语料库的语义更紧密地联系起来，从而大幅提升搜索和回答质量。

为了将查询重写技术整合到 RAG 中，需要专门针对 RAG 用例定制提示。查看每种方法的一些示例提示：

Q2E 提示

Your task is to brainstorm a set of useful search terms and related key phrases that could help locate information 
about the following question. Focus on capturing alternate expressions, synonyms, and specific entities or events 
mentioned in the query.
     Original Question: {query}
     Related Search Keywords:

Q2D 提示符

Imagine you are composing a short informative article that directly addresses a given question. Write a detailed passage 
that would help someone fully understand the subject or find an answer to the query.
     Query: {query}
     Passage:

CoT 查询重写提示符

Please carefully consider the following question. First, break down what the question is asking and think through any 
relevant facts, possible interpretations, or required background knowledge. Then, list out important words, concepts, or 
phrases that emerge from your reasoning process, which could help retrieve detailed answers.
     Question: {query}
     Your step-by-step reasoning and expansion terms:

NVIDIA Nemotron 模型如何推动 RAG 的发展？

NVIDIA Nemotron 系列推理和多模态模型以 Meta Llama 系列为基础，提供了一套针对效率、性能和 RAG 和代理系统等高级应用进行优化的 LLM。Nemotron 模型是一个开放的先进 AI 模型系列，旨在为企业 AI 智能体提供强大的推理能力、高效率和灵活的部署。这些模型有 Nano、Super 和 Ultra 三种尺寸，将 Meta Llama 架构与 NVIDIA 广泛的训练后技术相结合，在行业基准测试中实现了高精度。

在 Nemotron 系列模型中，我们发现 Llama 3.3 Nemotron Super 49B v1 模型最适合推动 RAG 进步的用例，特别是在考虑推理延迟和适当的推理能力时。在自然问答 (NQ) 数据集上的结果清楚地表明，查询重写显著提高了检索的准确性。Accuracy@K 表示在检索到的前 K 段落中找到正确答案的问题所占的比例。

自然问题 (NQ) 数据集	Accuracy@10（准确率@10）	Accuracy@20（准确率@20）
原始查询	43.1%	58.3%
Llama 3.3 Nemotron Super 49B v1 的 COT 查询重写功能	63.8%	74.7%

表 1。使用 BM25 作为重排器，在 NQ 数据集上 Llama Nemotron 重写的原始查询和查询之间的检索性能比较

使用 Llama Nemotron 的 RAG 工作流架构图

图 1 显示了 Llama 3.3 Nemotron Super 49B v1 增强型 RAG 流程的架构。

Enhanced RAG pipeline with Llama Nemotron, NeMo Retriever, and Slack components. — *图 1。使用 Nemotron 推理模型和 NVIDIA NeMo Retriever 增强 RAG 流程图*

在架构中，Llama Nemotron 模型用作查询提取器，具有以下功能：

分析用户查询以提取核心查询。这一步骤可以优化用户查询，排除可能对检索结果产生负面影响的不必要且分散注意力的短语。
分析用户查询以提取可用的过滤或排序标准。提取的过滤条件可用于混合检索搜索，或作为重新排序模型的输入来执行定性过滤。提取的排名标准使用户能够定义除了相关性之外的其他排名标准。
通过添加相关上下文信息来扩展核心查询。这一过程可能包括生成释义、将复杂查询分解为子查询或添加背景上下文等技术。这种方式扩展查询是有好处的，因为它提高了召回率和检索准确性，尤其是在用户查询含糊不清或不完整的情况下。
将扩展后的查询传递给 NVIDIA NeMo Retriever，以加速摄取、嵌入和重新排序。

Slack 与后端集成，可实现与其他应用的集成，无需开发和维护传统前端。以下几个关键组件确保了 Slack 用户与后端之间的无缝通信：

实时事件处理：SocketModeHandler 实现了实时事件处理，确保了 Slack 用户与后端之间的无缝通信。
模块化机器人设置：用于加载组件、连接到核心逻辑以及设置事件处理程序和日志。
有序的交互式用户体验：通过将所有回复作为串联式消息发布，最大限度地减少混乱并保持对话有序，从而增强用户体验。

为了本帖的目的，图 1 中所示的架构被应用于帮助改进 NVIDIA GTC 2025 会议的搜索结果。查询重写可确保语义相似性搜索检索到更集中的会话集。下一节将通过示例进一步解释这一点。

如何使用推理功能改进搜索查询引擎

RAG 工作流中查询重写至关重要的一个关键挑战是用户语言与内容词汇之间的语义差距。例如，考虑用户查询“用于训练低资源语言的 LLM 的会话”。此查询中的挑战是“低资源语言”一词。

通过此查询，用户正在寻找有关多语言 LLM 或 Sovereign AI 训练的会议。虽然许多 GTC 2025 会议都讨论了这个主题，但没有一个会议使用“低资源语言”这个关键词。相反，更常见的短语包括“多语言”、“非英语”、“主权 AI”或“韩语”、“法语”等特定语言。因此，使用原始查询检索和排名相关会议不太可能产生令人满意的结果。

为了解决这个问题，我们采用了 Q2E 技术来重写查询。在这种用户用例中，Q2D 和 COT 查询重写不合适，因为用户查询将是特定于域的，而通用 LLM 缺乏创建伪文档或用户查询上下文方面的知识，这会导致 LLM 产生幻觉的可能性很高。下面是此用例的 Q2E 样例提示。

## Instruction
### Goal
You are given a user query about querying for GTC sessions. Your task is to determine what topic or particular sessions 
the user is looking for.
### Steps
1. You should first extract the major request from the user query.
    - Understand the main search target in the user query, make sure you know what the user is looking for
    - Pay attention to all the details or keywords that are relevant to the main search target and include them. 
Please note that it is possible that the user will place the relevant keywords anywhere in the query but not necessarily 
right next to the main search target. Please relate ALL relevant search keywords and complete the main search query.
    - Include ALL non-filter/non-ranking **descriptive phrases**  in `main_query` even if they don't match available 
criteria, but **Remove subjective descriptors** like "promising" in `main_query`
    - EXCLUDE ALL the filtering and ranking criteria
    - **Remove event references** (e.g., "GTC", "SIGGRAPH") from `main_query` even if they appear mid-phrase
2. Provide your understanding/explanation on the main query extracted.
  - Write **EXACTLY 1-3 sentences** describing ONLY what the sessions are about, based strictly on the literal words 
in `main_query`.
- Use this template:
  `"Sessions focused on [exact field from main_query]. These sessions typically discuss [general description of what 
such sessions typically cover, elaborating on all KEY PHRASES from the main_query. Where appropriate, briefly mention
common goals, benefits, or general approaches relevant to the topic, as long as they are directly related to the key 
phrases and align with common understanding in the field.]."`
- **Do NOT mention any specific techniques, challenges, industries, methods, or examples unless they are explicitly stated
in the main_query.**
- **Do NOT add or infer information that is not present or clearly implied in the main_query.**
- **Elaborate on each key phrase in the main_query, providing context or typical session content that aligns with standard
 interpretations in the AI/tech field.**
- **Ensure your explanation is clear, human-like, and aligns with normal human perception and expectations for such 
sessions.**
- **Do NOT include any preamble, reasoning, or formatting other than the explanation sentence(s).**
  - **Example**:
    - User query 1: "Sessions about enabling AI-recommended knowledge articles for customer service agents"
    - Explanation 1: "Sessions focused on enabling AI-recommended knowledge articles for customer service agents. 
These sessions typically discuss how AI can recommend relevant articles in real time to help agents resolve customer 
issues more efficiently."
    - User query 2: "Any sessions that introduce large language models (LLMs) and their applications?"
    - Explanation 2: "Sessions focused on introducing large language models (LLMs) and their applications. These sessions 
typically discuss what LLMs are, how they are developed, and their uses in tasks like text generation, translation, 
and summarization."
    - User query 3: "Sessions on AI ethics and societal impact in technology"
    - Explanation 3: "Sessions focused on AI ethics and societal impact in technology. These sessions typically discuss 
ethical considerations in AI development and the broader effects of AI technologies on society."

### Output
Output as the following JSON format
{{
    "main_query": "", // string of major requests from the user query. Be as concise as possible while capturing all 
the descriptive phrases.
    "main_query_explanation": "", // Understanding/explanation on what kind of sessions the user is looking for 
based on the main query
}}

## User query
{query}

## Your Final output
```json
{{
    YOUR OUTPUT
}}
```

对于示例查询“Sessions for training an LLM for low-resourced language”，查询扩展可以显著提高基于语义相似度的检索器返回的最相关会议的排名。表 2 提供了更多详细信息。

原始查询：用于训练低资源语言的 LLM 的会话。
查询扩展：会议重点关注针对低资源语言的 LLM 训练。这些会议通常会讨论在语言训练数据有限的情况下开发 LLM 的方法。
会议名称	排名（原始查询）	排序（查询扩展）
知识桥接：构建计算高效的多语言主权 AI 前沿模型	20	7 7
使用合成数据生成的多域大语言模型适配	73	28
为印度十亿人口打造生成式 AI 解决方案	56	51

表 2。使用原始查询和查询扩展作为输入的典型查询的排名比较

此外，查询扩展有助于重排器在排名过程中专注于更广泛但仍然高度相关的范围。例如，Llama Nemotron 模型的截断逻辑思维 token 与不同的查询：

原始查询：“关键短语是 ‘训练’、‘LLM’ 和 ‘低资源语言’“
查询扩展：“关键短语是‘资源匮乏的语言’、‘有限的训练数据’、‘多语言’、‘领域适应’等”

请注意，通过查询扩展，即使会话中没有使用原始查询词，重排器也能更好地识别讨论相关概念的会话。这种更广阔的视角使重排器能够创建更全面、以用户为中心的排名，从而提供更深入地了解用户整体信息需求的会话。

查询重写的好处是什么？

通过查询重写改进搜索结果，增强型工作流比传统 RAG 方法具有令人信服的优势。主要优势来自对用户查询的智能重构。这增加了关键的上下文和细节。这一步骤负责创建高质量、高度相关的候选池，这是系统性能提升的最大因素。

这种方法面临哪些挑战？

查询重写需要 AI 推理，这会占用大量资源，而且比传统方法速度更慢，从而限制了可扩展性。此外，LLM 每次只能处理有限数量的文档，因此需要使用滑动窗口策略来处理大型候选集。这增加了复杂性，并可能影响全局排序质量。

何时优化 RAG 工作流？

如表 3 所示，这种增强的 RAG 工作流在准确性和精确度比速度更重要的领域中尤其有价值。

用例	增强 RAG 的推理能力的优势
法律文件分析	查询重写和扩展有助于为复杂案件找到最相关的先例并对其进行排序，从而提高法律分析的质量和可靠性。
临床试验研究	在医学领域，临床医生可以找到最适用的研究和指南，并确定其优先级，以用于诊断或治疗计划，从而改善患者的治疗效果。
风险评估和决策制定	及时、符合情境的相关信息对于风险评估、合规性和投资决策至关重要。

表 3。当准确性比速度更重要时，优化 RAG 工作流是有好处的。

开始增强您的 RAG 工作流

在这篇文章中，我们介绍了一种使用 NVIDIA Llama Nemotron 系列模型的推理功能来改进 RAG 工作流的创新方法。通过解决传统方法的局限性，这种增强型架构能够实现更有效、以用户为中心的访问，尤其是在需要高精度和细微理解的场景中。

如需详细了解 Llama Nemotron LLM 模型系列的全部功能，请参阅使用高级开放式 NVIDIA Llama Nemotron 推理模型构建企业 AI 智能体。您可以在 NVIDIA API 目录中尝试 NVIDIA NIM 模型。使用 NVIDIA NeMo Retriever 和 NVIDIA RAG Blueprint 进一步增强和加速您的 RAG 工作流。

如何使用 NVIDIA Llama Nemotron 模型通过推理增强 RAG 工作流

RAG 中的查询重写是什么？

Q2E 提示

Q2D 提示符

CoT 查询重写提示符

NVIDIA Nemotron 模型如何推动 RAG 的发展？

使用 Llama Nemotron 的 RAG 工作流架构图

如何使用推理功能改进搜索查询引擎

查询重写的好处是什么？

这种方法面临哪些挑战？

何时优化 RAG 工作流？

开始增强您的 RAG 工作流

标签

关于作者

如何使用 NVIDIA Llama Nemotron 模型通过推理增强 RAG 工作流

RAG 中的查询重写是什么？

Q2E 提示

Q2D 提示符

CoT 查询重写提示符

NVIDIA Nemotron 模型如何推动 RAG 的发展？

使用 Llama Nemotron 的 RAG 工作流架构图

如何使用推理功能改进搜索查询引擎

查询重写的好处是什么？

这种方法面临哪些挑战？

何时优化 RAG 工作流？

开始增强您的 RAG 工作流

标签

关于作者

相关文章

利用重新排名技术增强 RAG 管道性能

使用 NVIDIA 检索 QA 嵌入模型构建企业检索增强生成应用

相关文章

如何将计算机视觉工作流与生成式 AI 和推理集成

使用 NVIDIA Nemotron 构建检索增强生成 (RAG) 智能体

借助生成式 AI 通过分子合成途径进行推理

借助 NVIDIA NeMo 在 FP8 精度下提高训练吞吐量

在 NVIDIA RTX AI PC上部署高性能人工智能模型到 Windows 应用中