CVNov 21, 2024
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge ModelZewei Xin, Qinya Li, Chaoyue Niu et al.
Large text-to-image models demonstrate impressive generation capabilities; however, their substantial size necessitates expensive cloud servers for deployment. Conversely, light-weight models can be deployed on edge devices at lower cost but often with inferior generation quality for complex user prompts. To strike a balance between performance and cost, we propose a routing framework, called RouteT2I, which dynamically selects either the large cloud model or the light-weight edge model for each user prompt. Since generated image quality is challenging to measure and compare directly, RouteT2I establishes multi-dimensional quality metrics, particularly, by evaluating the similarity between the generated images and both positive and negative texts that describe each specific quality metric. RouteT2I then predicts the expected quality of the generated images by identifying key tokens in the prompt and comparing their impact on the quality. RouteT2I further introduces the Pareto relative superiority to compare the multi-metric quality of the generated images. Based on this comparison and predefined cost constraints, RouteT2I allocates prompts to either the edge or the cloud. Evaluation reveals that RouteT2I significantly reduces the number of requesting large cloud model while maintaining high-quality image generation.
OSJun 24, 2025
MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core SelectionZhengxiang Huang, Chaoyue Niu, Zhaode Wang et al.
As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including llama.cpp, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average.