Conversational AI

Develop Secure, Reliable Medical Apps with RAG and NVIDIA NeMo Guardrails

Imagine an application that can sift through mountains of patient data, intelligently searching and answering questions about diagnoses, health histories, and more. This AI-powered virtual “clinical assistant” could streamline preparation for an appointment with a patient, summarize health records, and readily answer queries about an individual patient. Such a system can also be fine-tuned to execute downstream tasks, such as clinical trials.

With the proliferation of large language models (LLMs), AI-powered solutions are emerging in the healthcare space to help medical professionals quickly extract, summarize, and decipher crucial, potentially life-saving information. However, LLM-based clinical assistant systems face challenges, such as potential errors due to LLM hallucinations and the risk of protected health information (PHI) leaks.

This post explores using a guardrails-infused retrieval-augmented generation (RAG) pipeline to develop an efficient, reliable, and secure virtual assistant for clinicians. The RAG pipeline ensures that answers presented by the bot are sourced from data, while guardrails enable fact-checking and prevent hallucinations. 

A RAG pipeline for a virtual clinical assistant 

The virtual clinical assistant proposed in this post consists of two main components: 

  • NVIDIA NeMo Guardrails, an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. It is part of NVIDIA NeMo, an end-to-end platform for developing custom generative AI.
  • A RAG system, which enhances LLM prompts with relevant data for more practical, accurate responses.

The NeMo Guardrails implementation will intelligently mediate conversations with clinicians, and the RAG system will generate answers based on clinician queries and patient documents.

Diagram showing a RAG pipeline with NVIDIA components along with Milvus and GatorTronGPT that was jointly trained by the University of Florida and NVIDIA.
Figure 1. Virtual clinical assistant RAG pipeline

Tools for building a RAG pipeline

Numerous commercial and open-source tools offer varying functionalities for managing the end-to-end LLM application development lifecycle, including RAG workflows. Building and deploying robust generative AI applications can require efficiently handling large-scale data curation and ingestion into a vector database. To learn more, see How to Take a RAG Application from Pilot to Production in Four Steps.

The system should also be capable of scaling to handle billions of vector queries in real time. Applications can also require LLM training or fine-tuning. Performance-optimized inference to reduce latency is also critical for the end-user experience. 

Building guardrails into the application

NeMo Guardrails orchestrates dialog management, ensuring accuracy, appropriateness, and security in smart applications with LLMs. For example, a model can be instructed not to talk about politics, to respond to specific user requests with predefined parameters, to follow a predefined dialog path, or to use a particular language style.

NeMo Guardrails serves as the RAG system’s backbone, adding reliability and security. It protects the application from attacks exploiting common language model vulnerabilities while maintaining answer quality through fact-checking and hallucination detection. NeMo Guardrails is also programmable and extensible, enabling developers to add unique application and domain-specific guardrails. In essence, guardrails empowers you to harness the immense potential of generative AI, minimizing risks and maximizing positive outcomes.

The RAG system efficiently retrieves patient information and generates accurate answers using an OpenAI Ada embedding model, the GPU-optimized vector database Milvus, and a prompt-tuned LLM. The embedding model processes the retrieved documents and stores them in the Milvus vector database for efficient indexing and retrieval. You can optionally select an LLM pretrained with medical knowledge and further fine-tune the model to maximize its performance on medical tasks.

Preserving data and privacy

To protect privacy and avoid working with real patient data for this task, the NVIDIA team created artificial patient records that are otherwise realistic. Clinical data was generated using an iterative sampling procedure with GatorTron GPT

Top-p sampling (nucleus sampling) and temperature sampling were applied to balance the diversity and quality of clinical text generation. The clinical text generation is limited to 2,048 tokens for this task. Figure 2 shows a snippet of the synthetically generated clinician data that is ingested into the Milvus vector database used to populate the RAG pipeline.

Screenshot of synthetically generated data, including a pathology report and a radiology report.
Figure 2. Excerpts of synthetically generated data used to populate the RAG pipeline

Inference process

When clinician requests are received, the NeMo Guardrails server mediates the conversation between the user and the system through five different rails at various steps. This keeps the generative model on track. The five rails include input rails, dialog rails, retrieval rails, execution rails, and output rails, explained in more detail below. 

Input rails sanitize the user input before passing it to the dialog rail, which can involve masking sensitive information or completely rejecting text that is toxic or not on topic. 

Dialog rails use the user input and the dialog history to orchestrate the system flow. This starts with what the user wants (intent generation) and proceeds to fulfilling that intent, which may involve a retrieval task or calling another tool using execution rails. 

A guardrails-enabled retrieval rail fortifies the RAG application so it doesn’t retrieve unrelated or sensitive patient documents, or anything else that may have been specified in response to requests sent by the dialog rail. 

Execution rails ensure the appropriate use of tools, as required. For example, for a math query, an execution rail would be responsible for extracting the relevant math expression that can be sent to a tool like Wolfram Alpha, and then appropriately replacing the expression with the tool’s result. 

Finally, the output rails consolidate the generated answer and eliminate possible hallucinations.

High-level flow through programmable guardrails in NeMo Guardrails.
Figure 3. High-level flow of programmable guardrails in NeMo Guardrails 

Get started

For an in-depth guide on how to filter harmful and inaccurate responses from your LLM application with cutting-edge tools like GatorTron GPT, Milvus vector database, and NVIDIA NeMo Guardrails, watch AI Safety Defenders: Reinforcing Medical Boundaries with Guardrails.

Discuss (0)