RelayLLM: Efficient Reasoning via Collaborative Decoding
This work addresses efficiency challenges in AI reasoning for applications requiring low-cost, high-performance models, representing a novel method rather than an incremental improvement.
The paper tackles the problem of high computational costs in large language models (LLMs) for complex reasoning by proposing RelayLLM, a token-level collaborative decoding framework that uses a small language model (SLM) to invoke an LLM only for critical tokens, achieving an average accuracy of 49.52% across six benchmarks with a 98.2% cost reduction compared to performance-matched baselines.
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.