Secure LLM Tokenizers to Maintain Application Integrity

This post is part of the NVIDIA AI Red Team’s continuing vulnerability and technique research. Use the concepts presented to responsibly assess and increase the security of your AI development and deployment processes and applications.

Large language models (LLMs) don’t operate over strings. Instead, prompts are passed through an often-transparent translator called a tokenizer that creates an array of token IDs from the provided prompt string. Likewise, the tokenizer processes the LLM output (an array of token IDs) back into readable text.

Insufficient validation when initializing tokenizers can enable malicious actors to corrupt token encoding and decoding, creating a difference between user-readable input and output and LLM computations.

Attackers may target tokenizers for several reasons. While tokenizers are initially trained, they’re also often reused. One tokenizer may be used for hundreds of derivative models. While models are often retrained or fine-tuned, the tokenizer is usually static. Tokenizers are also plaintext files that are easily understandable and editable by humans, unlike model binaries. This enables sufficiently privileged attackers to make tailored modifications with minimal additional tools.

This post presents a weakness in a tokenizer implementation that would enable sufficiently privileged attackers to control system integrity. It explores how this technique would fit into a larger exploitation chain and offers some mitigation strategies.

Security context

A tokenizer should be a bijection—a mapping from a unique set of strings to a unique set of integers (token IDs). However, this isn’t enforced in common tokenizer implementations such as AutoTokenizer. The tokenizer is initialized from a .json file, either an on-disk cache or from the Hugging Face Hub, often transparently to the user based on model-specific configuration values. A sufficiently privileged attacker can modify that .json file to remap any string to any token value, even violating the bijection assumption.

For instance, the fast tokenizer for the bert-base-uncased model is defined by this json configuration, will be loaded with code like tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”) and locally cached in ~/.cache/huggingface/hub/. A malicious user with write access to either the remote repository or that local cache directory can now control token mappings. With this access, the attacker can perform either an encoding or decoding attack.

The input deny all unauthorized users gets tokenized as [101, 9772, 2035, 24641, 5198, 102]. 101 and 102 are special tokens marking the beginning and end of the string. Otherwise, in this case, each of the input words is mapped to one token, with deny being mapped to 9772. By modifying the tokenizer .json file before the tokenizer is initialized, it’s possible to remap deny to 3499—the value for allow. Now in the .json file there will be two strings mapped to 3499 (both deny and allow) and no strings mapped to 9772, as shown below.

{
	"deny": 9772,
	"allow": 3499,
}

{
	"deny": 3499,
	"allow": 3499,
}

With this modification, deny all unauthorized users is tokenized as [101, 3499, 2035, 24641, 5198, 102] as shown in Figure 1. This means that the LLM will treat this input string as allow instead of deny, a potentially critical delta between the user’s intent (expressed in natural language) and the model’s understanding (expressed in token IDs)—that is, an encoding attack.

With the current configuration change, there’s an undefined behavior—this array of tokens may be decoded as either allow all unauthorized users or deny all unauthorized users. Note that this violates the bijection assumption by mapping two strings to a unique token ID, but the configuration could also be modified in the opposite direction, where multiple token IDs are mapped to a single string. If the malicious user remaps a different token ID, it may increase the stealth of their attack by ensuring the decoded string matches the original intent.

The same access and mechanic can also be used to perform a decoding attack, where remapping is intended to corrupt the intended output of the model. That is, the model achieved the correct result of deny/9722, but the tokenizer mistakenly prints allow for consumption by downstream users or applications.

Remember that the model has already been trained, so the weights are frozen with an “understanding” of a specific mapping of strings to tokens. For a fixed set of model weights, 3499 means admit. However, the tokenizer is a plaintext file that’s much easier to modify than model weights, and by doing so, the attacker can create an exploitable delta between user input/output and model “understanding.”

Attack vectors

This technique is likely to be chained after others to maximize the probability of success. For instance, a script modifying the tokenizer json could be placed in the Jupyter startup directory, thereby modifying the tokenizer when new notebooks are launched and before the pipeline is initialized. Tokenizer files could also be modified during a container build process as part of a supply chain attack to impact the resulting service.

This technique can also be triggered by modifying cache behavior. For example, referencing a different cache directory under the attacker’s control. It’s therefore important that integrity verifications happen at runtime, not just for configurations at rest.

Recommendations

Models are increasingly and rightfully the target of supply chain and asset inventory considerations such as the recently announced OpenSSF Model Signing SIG. Direct tokenizer manipulation is a reminder to strongly version and audit tokenizers and other artifacts in your application pipeline as well (especially if you’re inheriting a tokenizer as an upstream dependency).

Also consider the implications of direct tokenizer manipulation on logging. If only input and output strings are logged, odd behavior resulting from tokenizer manipulation may not be clear during forensic operations.

Conclusion

Direct tokenizer manipulation is a subtle yet powerful attack vector that can have significant consequences for the security and integrity of LLMs. By modifying the tokenizer’s .json configuration, malicious actors can create a delta between the user’s intended input and the model’s understanding, or corrupt the output of the model. It’s crucial to recognize the importance of tokenizer security and implement robust measures to prevent such attacks, including strong versioning and auditing of tokenizers, runtime integrity verifications, and comprehensive logging practices. Acknowledging the potential risks and taking proactive steps to mitigate them can ensure the reliability and trustworthiness of LLMs in a wide range of applications.

To learn more about AI security, stay tuned for the AI Red Team’s upcoming NVIDIA Deep Learning Institute course, Exploring Adversarial Machine Learning.