AR / VR

Develop Secure, Reliable Medical Apps with RAG and NVIDIA NeMo Guardrails

May 15, 2024

By Siddha Ganju

Discuss (0)

AI-Generated Summary

Dislike

A virtual clinical assistant can be developed using a retrieval-augmented generation (RAG) pipeline with NVIDIA NeMo Guardrails to ensure accurate and secure responses to clinician queries.
The RAG pipeline uses an OpenAI Ada embedding model and a GPU-optimized vector database, Milvus, to efficiently retrieve patient information and generate accurate answers.
NVIDIA NeMo Guardrails provides a programmable and extensible framework for adding guardrails to the RAG system, enabling fact-checking, hallucination detection, and protection against potential attacks.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Imagine an application that can sift through mountains of patient data, intelligently searching and answering questions about diagnoses, health histories, and more. This AI-powered virtual “clinical assistant” could streamline preparation for an appointment with a patient, summarize health records, and readily answer queries about an individual patient. Such a system can also be fine-tuned to execute downstream tasks, such as clinical trials.

With the proliferation of large language models (LLMs), AI-powered solutions are emerging in the healthcare space to help medical professionals quickly extract, summarize, and decipher crucial, potentially life-saving information. However, LLM-based clinical assistant systems face challenges, such as potential errors due to LLM hallucinations and the risk of protected health information (PHI) leaks.

This post explores using a guardrails-infused retrieval-augmented generation (RAG) pipeline to develop an efficient, reliable, and secure virtual assistant for clinicians. The RAG pipeline ensures that answers presented by the bot are sourced from data, while guardrails enable fact-checking and prevent hallucinations.

A RAG pipeline for a virtual clinical assistant

The virtual clinical assistant proposed in this post consists of two main components:

NVIDIA NeMo Guardrails, an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. It is part of NVIDIA NeMo, an end-to-end platform for developing custom generative AI.
A RAG system, which enhances LLM prompts with relevant data for more practical, accurate responses.

The NeMo Guardrails implementation will intelligently mediate conversations with clinicians, and the RAG system will generate answers based on clinician queries and patient documents.

Tools for building a RAG pipeline

Numerous commercial and open-source tools offer varying functionalities for managing the end-to-end LLM application development lifecycle, including RAG workflows. Building and deploying robust generative AI applications can require efficiently handling large-scale data curation and ingestion into a vector database. To learn more, see How to Take a RAG Application from Pilot to Production in Four Steps.

The system should also be capable of scaling to handle billions of vector queries in real time. Applications can also require LLM training or fine-tuning. Performance-optimized inference to reduce latency is also critical for the end-user experience.

Building guardrails into the application

NeMo Guardrails orchestrates dialog management, ensuring accuracy, appropriateness, and security in smart applications with LLMs. For example, a model can be instructed not to talk about politics, to respond to specific user requests with predefined parameters, to follow a predefined dialog path, or to use a particular language style.

NeMo Guardrails serves as the RAG system’s backbone, adding reliability and security. It protects the application from attacks exploiting common language model vulnerabilities while maintaining answer quality through fact-checking and hallucination detection. NeMo Guardrails is also programmable and extensible, enabling developers to add unique application and domain-specific guardrails. In essence, guardrails empowers you to harness the immense potential of generative AI, minimizing risks and maximizing positive outcomes.

The RAG system efficiently retrieves patient information and generates accurate answers using an OpenAI Ada embedding model, the GPU-optimized vector database Milvus, and a prompt-tuned LLM. The embedding model processes the retrieved documents and stores them in the Milvus vector database for efficient indexing and retrieval. You can optionally select an LLM pretrained with medical knowledge and further fine-tune the model to maximize its performance on medical tasks.

Preserving data and privacy

To protect privacy and avoid working with real patient data for this task, the NVIDIA team created artificial patient records that are otherwise realistic. Clinical data was generated using an iterative sampling procedure with GatorTron GPT.

Top-p sampling (nucleus sampling) and temperature sampling were applied to balance the diversity and quality of clinical text generation. The clinical text generation is limited to 2,048 tokens for this task. Figure 2 shows a snippet of the synthetically generated clinician data that is ingested into the Milvus vector database used to populate the RAG pipeline.

Inference process

When clinician requests are received, the NeMo Guardrails server mediates the conversation between the user and the system through five different rails at various steps. This keeps the generative model on track. The five rails include input rails, dialog rails, retrieval rails, execution rails, and output rails, explained in more detail below.

Input rails sanitize the user input before passing it to the dialog rail, which can involve masking sensitive information or completely rejecting text that is toxic or not on topic.

Dialog rails use the user input and the dialog history to orchestrate the system flow. This starts with what the user wants (intent generation) and proceeds to fulfilling that intent, which may involve a retrieval task or calling another tool using execution rails.

A guardrails-enabled retrieval rail fortifies the RAG application so it doesn’t retrieve unrelated or sensitive patient documents, or anything else that may have been specified in response to requests sent by the dialog rail.

Execution rails ensure the appropriate use of tools, as required. For example, for a math query, an execution rail would be responsible for extracting the relevant math expression that can be sent to a tool like Wolfram Alpha, and then appropriately replacing the expression with the tool’s result.

Finally, the output rails consolidate the generated answer and eliminate possible hallucinations.

Get started

For an in-depth guide on how to filter harmful and inaccurate responses from your LLM application with cutting-edge tools like GatorTron GPT, Milvus vector database, and NVIDIA NeMo Guardrails, watch AI Safety Defenders: Reinforcing Medical Boundaries with Guardrails.

Discuss (0)

About the Authors

About Siddha Ganju
Siddha Ganju, whom Forbes featured in their 30 under 30 list, leads AI innovation in LLM and Guardrails along with the deployment of medical instruments for NVIDIA partners at NVIDIA. Siddha previously worked in the self-driving teams for simulation, perception, scalable training, and inference along with global automotive partnerships and go-to-market strategies. A graduate of Carnegie Mellon University, her prior work ranges from Visual Question Answering to Generative Adversarial Networks to gathering insights from CERN's petabyte-scale data and has been published at top-tier conferences including CVPR and NeurIPS. Siddha mentors and co-founded the Learn-To-Race team at Carnegie Mellon University which landed at the podium for high-speed racing. Siddha authored O'Reilly's 600-page book on Practical Deep Learning for Cloud, Mobile, and Edge. With its strong reception, the book is being translated into five languages all in less than a year of publication.

View all posts by Siddha Ganju