Retrieval-augmented generation (RAG) is a technique that combines information retrieval with a set of carefully designed system prompts to provide more accurate, up-to-date, and contextually relevant responses from large language models (LLMs). By incorporating data from various sources such as relational databases, unstructured document repositories, internet data streams, and media news feeds, RAG can significantly improve the value of generative AI systems.
Developers must consider a variety of factors when building a RAG pipeline: from LLM response benchmarking to selecting the right chunk size.
In this post, I demonstrate how to build a RAG pipeline using NVIDIA AI Endpoints for LangChain. First, you create a vector store by downloading web pages and generating their embeddings using the NVIDIA NeMo Retriever embedding microservice and searching for similarity using FAISS. I then showcase two different chat chains for querying the vector store. For this example, I use the NVIDIA Triton Inference Server documentation, though the code can be easily modified to use any other source.
For more information and to follow along, see the Build a RAG chain by generating embeddings for NVIDIA Triton Inference Server documentation notebook.
Tutorial prerequisites
To make the best use of this tutorial, you need basic knowledge of LLM training and inference pipelines along with the following resources:
- LangChain
- NVIDIA AI Foundation Endpoints
- A vector store
What is RAG and how does it empower LLMs?
Here’s why it’s important to augment LLMs with a RAG pipeline.
Under the hood, LLMs are neural networks, typically measured by how many parameters they contain. An LLM’s parameters essentially represent the general patterns of how humans use words to form sentences.
That deep understanding, sometimes called parameterized knowledge, makes LLMs useful in responding to general prompts at light speed. However, it does not serve users who want a deeper dive into a current or more specific topic.
RAG fills a gap in how LLMs work (Figure 1). Its purpose is to link generative AI services to external resources, especially ones rich in the latest technical details. The original paper called RAG “a general-purpose fine-tuning recipe” because it can be used by nearly any LLM to connect with practically any external resource.
Reduce LLM hallucinations and improve model responses
RAG offers numerous benefits in the realm of language model technology, particularly when it comes to empowering LLMs with up-to-date information. By integrating a search function that retrieves relevant data from external sources, RAG ensures that LLMs are equipped with the most current knowledge available, thereby enhancing the accuracy and relevance of their responses (Figure 2).
RAG also provides a solution to the challenge of data privacy, as it enables LLMs to generate responses without requiring direct access to sensitive data. This is achieved by retrieving only the necessary information from external sources, thereby minimizing the risk of data breaches.
RAG has the potential to mitigate the issue of LLM hallucinations, which refers to the generation of inaccurate or misleading information due to the limitations of the model’s training data. By providing LLMs with real-time access to external sources, RAG can help reduce the likelihood of hallucinations and improve the overall reliability of the model’s responses.
Considerations for RAG implementation
RAG presents some challenges that must be addressed to fully realize its potential. One such challenge is ensuring that the prompt used to retrieve information from external sources is well-crafted and accurately reflects the user’s intent. A poorly constructed prompt may result in irrelevant or incomplete information being retrieved, which can negatively impact the quality of the LLM’s response.
Another challenge is determining how to evaluate the success of RAG in a meaningful and objective manner. Metrics such as response accuracy and relevance are important, but they may not fully capture the nuances of RAG’s impact on LLM performance.
Finally, optimizing RAG to maximize its benefits and minimize its challenges is an ongoing process that requires careful consideration of factors such as search algorithm efficiency, information retrieval relevance, and LLM integration. For more information about other challenges, see the Seven Failure Points When Engineering a Retrieval Augmented Generation System paper and 12 RAG Pain Points and Proposed Solutions article.
By addressing these challenges, RAG has the potential to significantly enhance the capabilities of LLMs and unlock new possibilities for natural language processing applications.
What is LangChain used for?
LangChain is an open-source framework that simplifies the development of applications using LLMs. It provides tools and abstractions to improve the customization, accuracy, and relevancy of LLMs, enabling the creation of various applications like chatbots, question-answering, content generation, and summarizers.
LangChain consists of LangChain Libraries, LangChain Templates, LangServe, and LangSmith, offering interfaces, integrations, reference architectures, and a developer platform for building and deploying LLM-powered applications. The framework includes standard interfaces for Model I/O, Retrieval, and Agents, enabling you to integrate data sources, LLMs, and tools to build complex applications.
LangChain is part of a rich ecosystem of tools, supported by an active community, and simplifies AI development by abstracting the complexity of data source integrations.
Set up
To get started, create a free account with the NVIDIA API catalog and follow these steps:
- Select any model.
- Choose Python, Get API Key.
- Save the generated key as
NVIDIA_API_KEY
.
From there, you should have access to the endpoints.
The full notebook for this post is available as part of the NVIDIA Generative AI Examples GitHub repo.
First, install LangChain, NVIDIA AI Endpoints, and FAISS.
pip install langchain
pip install langchain_nvidia_ai_endpoints
pip install faiss-gpu
Baseline
To establish a baseline for comparison, first evaluate the response using a vanilla LLM:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(model="ai-llama2-70b", max_tokens=1000)
Now try asking a basic question about NVIDIA Triton Inference Server.
result = llm.invoke("What interfaces does Triton support?")
print(result.content)
Triton, a product offering from Cohesity, is a data protection and management solution designed to simplify and streamline the backup and recovery of data across various environments. Triton supports several interfaces, including…
The baseline LLM does not recognize Triton Inference Server within the context of the question and responds with incorrect information. Try adding NVIDIA to improve the prompt.
result = llm.invoke("What interfaces does NVIDIA Triton support?")
print(result.content)
NVIDIA Triton is a platform for AI model inference serving, and it supports several interfaces to provide flexibility and ease of integration with various applications and frameworks.
It now recognizes Triton Inference Server, but doesn’t give much detail.
result = llm.invoke("But why?")
print(result.content)
These interfaces ensure that NVIDIA Triton can be easily integrated into various AI and machine learning projects, providing a scalable and high-performance inference serving solution.
I follow these guidelines to ensure that I provide valuable, safe, and respectful assistance to you. As a helpful, reliable, and trustworthy assistant, I am committed to promoting positive and fair interactions. By avoiding harmful, unethical, prejudiced, or negative content, I aim to create a helpful and secure environment for all users.
Looking at the second sentence, you see that the LLM hallucinates. Now, add external sources of information to the LLM to augment it for your prompts and evaluate whether it improves the accuracy of the responses.
Read HTML and split text
Load a list of web pages containing technical documentation about NVIDIA Triton Inference Server in preparation for the embedding vector store.
# List of web pages containing NVIDIA Triton technical documentation
urls = [ "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",
"https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html",
"https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html",
"https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html",
"https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html",
]
documents = []
for url in urls:
document = html_document_loader(url)
documents.append(document)
Split into chunks
Next, split the documents into separate chunks. Make sure to pay attention to the chunk_size
parameter in TextSplitter. Setting the right chunk size is critical for RAG performance, as much of a RAG pipeline’s success is based on the retrieval step finding the right context for generation.
The retrieval step typically examines smaller chunks of the original text rather than all documents. The entire prompt (retrieved chunks plus the user query) must fit within the LLM’s context window. Don’t specify chunk sizes too big and do balance them out with the estimated query size.
For example, while OpenAI LLMs have a context window of 8–32K tokens, Llama2 is limited to 4K tokens. If the chunks are too small, there is a risk that vital information might not be among the top retrieved chunks due to high granularity. On the other hand, if the chunks are too big, they may not fit in the LLM context window, slowing down the system.
To address this, build an ensemble retrieval over different chunk sizes and benchmark the results to find the optimal value. By looping the ensemble on a set of test queries, you can calculate the mean reciprocal rank (MRR) for each chunk size, for a more informed decision on the optimal chunk size for the RAG system.
Experiment with different chunk sizes, but typical values should be 100-600, depending on the LLM.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0,
length_function=len,
)
texts = text_splitter.create_documents(documents)
Generate embeddings
Next, generate embeddings using NVIDIA AI Foundation endpoints and save embeddings to an offline vector store in the /embed directory for future re-use.
For this task, use FAISS, a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.
embeddings = NVIDIAEmbeddings()
for document in documents:
texts = splitter.split_text(document.page_content)
# metadata to attach to document
metadatas = [document.metadata]
# create embeddings and add to vector store
if os.path.exists(dest_embed_dir):
update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings)
update.add_texts(texts, metadatas=metadatas)
update.save_local(folder_path=dest_embed_dir)
else:
docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
docsearch.save_local(folder_path=dest_embed_dir)
Load embeddings from the vector store
Create the embeddings model using the NVIDIA NeMo Retriever text embedding model. This model represents words, phrases, or other entities as vectors of numbers and understands the relation between words and phrases.
embedding_model = NVIDIAEmbeddings()
embedding_path = "embed/"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model)
First, create a ConversationalRetrievalChain chain using ChatNVIDIA. In this chain, I demonstrate the use of one LLM.
llm = ChatNVIDIA(model="ai-llama2-70b", temperature=0.1, max_tokens=1000, top_p=1.0)
qa_prompt=QA_PROMPT
doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)
qa = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=docsearch.as_retriever(),
chain_type="stuff",
memory=memory,
combine_docs_chain_kwargs={'prompt': qa_prompt},
)
Ask any question about Triton Inference Server
Now try using the same prompts and gauging the difference.
query = "What interfaces does Triton support?"
result = qa({"question": query})
print(result.get("answer"))
NVIDIA Triton Inference Server supports several interfaces, including HTTP/REST and gRPC inference protocols. These interfaces are based on the community-developed KServe protocol. Additionally, Triton Inference Server provides a C API and Java API, which allows it to link directly into your application for edge and other in-process use cases. It also supports various deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, and RAPIDS FIL, among others.
The LLM recognizes Triton Inference Server correctly with a minimal prompt and responds with a technically detailed response.
query = "But why?"
result = qa({"question": query})
print(result.get("answer"))
Triton Inference Server supports various interfaces and frameworks to provide flexibility and ease of use for developers and data scientists. By supporting multiple deep learning and machine learning frameworks such as TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more, Triton Inference Server allows users to deploy any AI model that best suits their project requirements. This interoperability helps in streamlining the AI inferencing process across different platforms, including cloud, data center, edge, and embedded devices, and ensures optimized performance for various query types. Furthermore, Triton Inference Server supports HTTP/REST and gRPC inference protocols, which are widely used and recognized in the industry, making it a versatile solution for AI inference serving.
The LLM now no longer hallucinates as it has knowledge of the domain.
A more complex chain
Now create a more complex chain with two LLMs, one for summarization and another for chat. This improves the overall result in more complicated scenarios. Use Llama2 70B for the first LLM and Mixtral for the chat element in the chain. Add question_generator
to generate relevant query prompts.
llm = ChatNVIDIA(model="ai-llama2-70b")
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)
chat = ChatNVIDIA(model="ai-mixtral-8x7b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)
doc_chain = load_qa_chain(chat , chain_type="stuff", prompt=QA_PROMPT)
qa = ConversationalRetrievalChain(
retriever=docsearch.as_retriever(),
combine_docs_chain=doc_chain,
memory=memory,
question_generator=question_generator,
)
Here’s the next query and response:
query = "What interfaces does Triton support?"
result = qa({"question": query})
print(result.get("answer"))
The supported interfaces in Triton for inference are:
* HTTP/REST * GRPC * Binary Tensor Data Extension * Classification Extension * Generate Extension * Logging Extension * Model Configuration Extension * Model Repository Extension * Schedule Policy Extension * Sequence Extension * Shared-Memory Extension * Statistics Extension * Trace Extension
These interfaces allow for a wide range of use cases, including real-time inference, batched inference, ensemble inference, and audio/video streaming. Additionally, Triton provides a Backend API that allows developers to add custom backends and pre/post-processing operations, as well as support for model pipelines using Ensembling or Business Logic Scripting (BLS).
The response is now summarized in bullets:
query = "But why?"
result = qa({"question": query})
print(result.get("answer"))
Triton supports many interfaces to provide flexibility and versatility for its users. By supporting multiple deep learning and machine learning frameworks, Triton allows developers to use their framework of choice, making it easier for them to integrate Triton into their existing workflows. This flexibility enables Triton to be used in a wide range of applications, from real-time inference to batch processing, and from cloud-based services to edge devices.
And the follow-up response is also more concise.
How to evaluate
Choosing the right metric for evaluating results is critical for RAG systems. It is not enough to gauge accuracy alone. Relevancy, an absence of hallucinations, and runtime metrics such as average response time must also be considered.
LangChain includes evaluators for such metrics:
- Criteria Evaluator checks correctness by assigning a score of 1-5 along with a reasoning for the score. It can also evaluate the relevance of a response considering the context.
- Embedding Distance Evaluator evaluates the embedding similarity of the response.
- Exact Match Evaluator compares the response to a reference label.
By using these metrics, you can ensure that your RAG systems are accurate, relevant, and free from hallucinations, while also providing fast and efficient responses.
Conclusion
RAG has emerged as a powerful approach, combining the strengths of LLMs and dense vector representations. By using dense vector representations, RAG models can scale efficiently, making them well-suited for large-scale enterprise applications.
As LLMs continue to evolve, it is clear that RAG will play an increasingly important role in driving innovation and delivering high-quality, intelligent systems that can understand and generate human-like language.
When building your own RAG pipeline, it’s important to correctly split the vector store documents into chunks by optimizing the chunk size for your specific content and selecting a LLM with suitable context length. For some cases, complex chains of multiple LLMs may be required. To optimize RAG performance and measure success, use a collection of robust evaluators and metrics.
To get started, the full notebook for this post is available as part of the NVIDIA Generative AI Examples repository. For more information about additional models and chains, see NVIDIA AI LangChain endpoints.