Black-box Optimization of LLM Outputs by Asking for Directions
This work addresses security vulnerabilities in deployed LLMs, showing that model capability improvements paradoxically enhance vulnerability, which is a significant concern for AI safety and deployment.
The paper tackles the problem of black-box attacks on large language models (LLMs) by exploiting their ability to express confidence in natural language, enabling adversarial optimization without access to internal outputs like logits. It successfully generates malicious inputs in scenarios such as adversarial examples, jailbreaks, and prompt injections, expanding the attack surface for deployed LLMs.
We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).