Discovering the Hidden Vocabulary of DALLE-2
This work highlights a security vulnerability in large-scale generative models, posing risks for misuse and interpretability challenges, though it is incremental in exposing specific model quirks.
The researchers discovered that DALLE-2 has a hidden vocabulary of seemingly random text tokens that consistently generate specific visual concepts like birds or bugs, revealing potential security and interpretability issues.
We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For example, it seems that \texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests. We find that these prompts are often consistent in isolation but also sometimes in combinations. We present our black-box method to discover words that seem random but have some correspondence to visual concepts. This creates important security and interpretability challenges.