CRAIApr 24

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

arXiv:2604.2288889.1Has Code
Predicted impact top 6% in CR · last 90 daysOriginality Incremental advance
AI Analysis

For LLM agent security, this work addresses a new and severe threat (skill poisoning) with a detection method that outperforms existing approaches, though it is domain-specific.

The paper tackles pre-execution detection of skill poisoning in LLM agents, where attackers hide malicious instructions in legitimate skills. RouteGuard, a frozen-backbone detector using attention and hidden-state fusion, achieves 0.8834 F1 on the Skill-Inject benchmark and recovers 90.51% of description attacks missed by lexical screening.

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes