CRLGJan 3, 2025

Rerouting LLM Routers

arXiv:2501.01818v19 citationsh-index: 60Has Code
Originality Incremental advance
AI Analysis

This addresses a safety problem for AI systems using LLM routers, exposing vulnerabilities that could increase costs or degrade performance, and is incremental as it builds on existing routing concepts.

The paper tackles the adversarial robustness of LLM routers, which balance quality and cost by routing queries to appropriate LLMs, by showing that adversaries can use 'confounder gadgets' to force routers to send queries to strong LLMs, with attacks succeeding in 100% of cases in white-box settings and over 90% in black-box settings against various routers.

LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes