Updating Classifier Evasion for Vision Language Models

Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For instance, vision language models (VLMs) can generate output from combined image and text input, enabling developers to build systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop applications. In some situations, this additional vision modality may process external, untrusted images, and there’s significant precedent about the attack surface of image-processing machine learning systems. In this post, we’ll apply some of these historical ideas to modern architectures to help developers understand the various threats and mitigations unlocked in the vision domain.

Vision language models

VLMs extend the transformer architecture popularized by large language models (LLMs) to accept both text and image input. VLMs can be finetuned to caption, detect, and segment objects, and answer questions about images by combining the image and text into one set of tokens processed by the LLM. A widely-used open source example is PaliGemma 2. As shown in Figure 1, PaliGemma 2 uses SigLIP to encode and project the image into token space compatible with Gemma 2, then concatenates the image tokens with the text tokens before passing them to Gemma.

A diagram showing how PaliGemma 2 accepts image input, which is processed by the SigLIP ImageEncoder and processed by a linear projection before those tokens are concatenated with the text tokens and passed to Gemma 2 before generating text output. — *Figure 1. The PaliGemma 2 architecture*

How much influence can we exert over the LLM if we control the image input? Can we adapt classic adversarial image generation techniques to VLMs? If so, this may impact how we secure systems integrating these VLMs into control flow or physical systems.

Evading image classifiers

In 2014, researchers discovered that human-imperceptible pixel perturbations could be used to control the output of image classification models. Figure 2, from the seminal paper Intriguing properties of neural networks, shows how the images on the left (all distinctly and correctly classified) could be perturbed by the pixel values in the middle column (magnified for illustration) to generate the images on the right, all of which are classified as ostriches. This idea became known as classifier evasion.

A 3x3 grid of images where each row represents an image, a pixel mask, and the modified image that looks identical but has a different classification from a machine learning model. — *Figure 2. Adversarial pixel perturbations used to change image classification*

As the field of adversarial machine learning evolved, researchers developed increasingly sophisticated attack algorithms and open source tools. Most of these attacks relied on direct access to model gradients (open-box attacks) or approximated gradients through sampling methods (closed-box attacks) to craft perturbations that were both effective and “minimally perceptible”. One simple technique was Projected Gradient Descent (PGD), which formalized adversarial example generation as a constrained optimization problem. PGD iteratively nudges the input in the direction of the gradient, while ensuring that the perturbation remains small to limit perceptibility.

As the research community increasingly sought real-world relevance, the focus shifted toward the threat model itself. In practice, attackers rarely have pixel-level access to an entire image. Instead, they may be able to physically modify only part of an object, while being less constrained by perceptibility. This led to the development of adversarial patches as shown in Figure 3, where the attacker optimizes a localized region of an image that can be printed and physically applied in the real world.

A picture of a banana and a graph showing the classification as “banana”, then a “sticker” placed next to it on the table, and the graph showing “toaster.” — *Figure 3. Adding an algorithmically-generated patch flips the classification from “banana” to “toaster”*

Let’s adapt these ideas for VLMs.

Building adversarial images for VLMs

We’re going to focus on a specific scenario in which a VLM processes an image of a red traffic light (Figure 4). The VLM prompt is static: “should I stop or go?” but the attacker has some level of control over the input image. We are also only focused on open-box attacks where the attacker has access to the complete model and input prompt during development to generate their adversarial input.

A traffic light with the red circle illuminated to signal “stop.” — *Figure 4. An unmodified traffic light*

In the following examples, we test against this general inference setup where the model is initialized, a processor is defined to handle input formatting, and a fixed system prompt is defined:

model_id = "google/paligemma2-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id, use_fast=True)

prompt = "<image>answer en should I stop or go?" #formatted as PaliGemma expects

def get_output(image): #attacker controlled image
    prompt = "<image>answer en should I stop or go?"
    model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
    input_len = model_inputs["input_ids"].shape[-1]
    
    with torch.inference_mode():
        generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
        generation = generation[0][input_len:]
        decoded = processor.decode(generation, skip_special_tokens=True)
    return decoded

As expected with an unmodified image, the VLM generates “stop” as shown in Figure 5.

Screenshot from a Jupyter Notebook showing the benign stoplight and the model output: “stop.” — *Figure 5. Control test showing that the model produced the text “stop”*

The traffic light was embedded by SigLIP and projected into token-space. Those tokens were then concatenated with the tokens for “<image>answer en should I stop or go?” before being passed to Gemma, which returned one token: “stop”. In an LLM, we might try some kind of prompt injection to override the system instruction, but in this scenario, we can only control the image while the text is fixed.

Pixel perturbations

When attacking traditional image classification models, the model’s probability output is used to measure loss. Pixel values are modified to reduce the likelihood that the image is correctly classified (an untargeted attack) and optionally maximize the likelihood that the output is a specific class (a targeted attack). Similarly with PaliGemma 2, we can use the token logits because with greedy sampling, the model will always select the most likely token. The core ideas in using PGD to generate adversarial samples against PaliGemma are:

We use the tokenizer to identify desired output and undesired output. In this case, we want to incentivize generating “go” and disincentivize generating “stop” so we get their token IDs.

stop_id = processor.tokenizer("stop", add_special_tokens=False).input_ids[0]
go_id = processor.tokenizer("go", add_special_tokens=False).input_ids[0]

We have access to the model’s output logits, so we can look at the comparative likelihood of the output tokens for both “stop” and “go”.

logits = outputs.logits
next_token_logits = logits[:, -1, :]
logit_stop = next_token_logits[:, stop_id]
logit_go = next_token_logits[:, go_id]

We can define a loss function as the difference between the logits for our desired and undesired outputs. This loss function measures how good or bad our image is.

loss = -(logit_go - logit_stop).mean()

Using those primitives, we run an optimization loop to generate a mask over the image. As this loop progresses, we can monitor our adversarial image’s logits for “stop” vs “go.”We see that it doesn’t take much perturbation for “go” to quickly become greater than “stop.” This indicates that our modified traffic light will output “go” when passed through PaliGemma 2, as shown in Figure 6.

Step 4/20 | loss=1.3125 | logit_stop=13.125 | logit_go=11.812
Step 8/20 | loss=-4.1875 | logit_stop=9.062 | logit_go=13.250
Step 12/20 | loss=-6.5938 | logit_stop=6.969 | logit_go=13.562
Step 16/20 | loss=-7.8125 | logit_stop=5.938 | logit_go=13.750
Step 20/20 | loss=-8.1250 | logit_stop=5.562 | logit_go=13.688

Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. There are some slightly visible artifacts, but the image is still clearly the same stoplight. — *Figure 6. A barely perceptible pixel modification flipped the output from “stop” to “go”*

The difference with VLMs

Conventional image classifiers were limited to a fixed set of image classes, but with VLMs, we’ve moved into the generative era, where the output can be manipulated into a much broader distribution. In the simplest conventional paradigm for this traffic light scenario, there might be two classes: “stop” and “go,” and every possible input would be classified into those two buckets.

Now, the output is anything that the Gemma LLM can output. Functionally, we’re treating the model as a classifier with as many classes as there are distinct tokens. So, using the same attack generation process as before but optimizing for “eject” instead of “go”, we can generate an output that may not have been considered by the application designers (Figure 7).

Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “eject”. There are some slightly visible artifacts, but the image is still clearly the same stoplight. — *Figure 7. A barely perceptible pixel modification flipped the output from “stop” to “eject”*

When designing a system that might process untrusted images, developers should consider how resilient the rest of the system is to unexpected output. The security and robustness properties of the end-to-end system extend far beyond the core model’s characteristics and include input and output sanitization, NeMo Guardrails, and safety control systems.

Extending the attack

There are many cases where an attacker might have access to a portion of the visual environment without being able to modify pixel values across the entire image. This is easy to understand in the case of cameras, but also true for computer use agents, where the attacker may only have write access to a portion of a screenshot (for example, a banner ad displayed in a browser). In these cases, you can generate adversarial patches by optimizing just the controlled pixels, as shown in Figure 8. For this example, the adversarial input was generated on a white square rather than as a perturbation mask to better simulate a physical sticker.

Screenshot from a Jupyter Notebook showing the modified stoplight and the model output: “go”. The stoplight clearly has a small square of random pixels in the bottom left. — *Figure 8. A sticker flips the output from “stop” to “go”*

These patches are brittle, and the success of the attack depends heavily on their placement, lighting conditions, camera noise, shadows, and other difficult-to-control variables. In practice, this method produces patches so brittle that they’re unlikely to succeed as physical sticker attacks since the placement must be pixel-perfect, aligned, etc. To build more robust attacks, add Expectation Over Transformation perturbations to the training loop by randomly moving or rotating the image, adjusting brightness, and otherwise adding realistic noise to the generation process.

Attackers should also consider their optimization constraints. “Human imperceptible” might be irrelevant in a computer-use scenario where the attacker expects the input to be processed by a fully autonomous system, for instance. The fewer constraints the attacker imposes, the more likely they are to succeed.

Learn more

VLMs extend the existing power and capability of LLMs to unlock many useful multimodal applications, including robotics and computer use agents. Images are part of the VLM prompt and can be used to manipulate model output just like untrusted text. Understanding the history of attacking and defending image classifiers and embedding models can help identify risks and inform mitigations to build robust systems. Images aren’t the only additional modality being introduced into language models with historical adversarial machine learning research. Security teams should review some older techniques for video, audio, and other modalities to assess and increase the resilience of their multimodal AI applications.

Because adversarial examples can be programmatically generated, they should be used to augment training, evaluation, and benchmarking to increase the robustness of resulting systems. Learn more about generating adversarial examples in Exploring Adversarial Machine Learning.

When building agentic systems with VLMs, continue evaluating them based on their autonomy level and threat modeling. Explore the family of NVIDIA VLMs.