LGAIAug 18, 2025

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

arXiv:2508.12590v11 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses energy and communication efficiency for deploying LLMs in bandwidth-constrained edge environments, representing an incremental improvement over prior hybrid models.

The paper tackles the problem of energy-efficient on-device LLM inference in resource-constrained environments by proposing a token-level filtering mechanism that uploads only informative tokens, achieving up to 87.5% BERT Score and 40.7% energy savings compared to standard hybrid models.

To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous U-HLM baseline, our method improves BERTScore from 85.8% to 87.0%, energy savings from 31.6% to 43.6%, and throughput from 0.36 to 0.40. This approach enables an energy-efficient and accurate deployment of LLMs in bandwidth-constrained edge environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes