Agent-X: Full Pipeline Acceleration of On-device AI Agents
Reduces high latency of AI agents on edge devices, a critical bottleneck for practical deployment.
Agent-X accelerates on-device LLM-based agents by 1.61x end-to-end with no accuracy loss, using prompt rewriting for prefix caching and LLM-free speculative decoding.
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.