CLAILGMar 4, 2025

Adversarial Tokenization

arXiv:2503.02174v29 citationsh-index: 41ACL
Originality Highly original
AI Analysis

This reveals a previously unknown vulnerability in subword models, posing a safety problem for LLM users and developers.

The paper tackles the problem of LLM safety by showing that adversarial tokenization, using alternative tokenizations of malicious strings, can effectively evade safety and alignment restrictions without altering the text, achieving competitive results against existing adversarial methods across three state-of-the-art LLMs and datasets.

Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes