CL DCApr 10, 2025

Token Level Routing Inference System for Edge Devices

Jianshu She, Wenhao Zheng, Zhengzhong Liu, Hongyi Wang, Eric Xing, Huaxiu Yao, Qirong Ho

arXiv:2504.07878v14 citationsh-index: 7ACL

Originality Incremental advance

AI Analysis

This addresses the problem of deploying efficient and high-quality AI on resource-constrained edge devices, representing an incremental improvement in collaborative decoding methods.

The paper tackles the trade-off between computational efficiency and response quality for large language model inference on edge devices by proposing a collaborative decoding system where a small on-device model consults a cloud-based large model for critical tokens. It achieves a 60% performance gain on CommonsenseQA with a 0.5B model, uploading under 7% of tokens to the cloud.

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

View on arXiv PDF

Similar