CVAIMay 7

XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

arXiv:2605.0692750.1
Predicted impact top 69% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying object detection on resource-constrained edge devices, XiYOLO provides a practical method to optimize energy-accuracy tradeoffs with minimal hardware measurements.

XiYOLO introduces an energy-aware framework combining a specialized search space, two-stage energy estimator, and iterative search to design energy-efficient object detectors for heterogeneous edge devices. On PascalVOC, the medium model achieves 86.15 mAP50 with 20.6% less energy on GPU and 35.9% less on NPU compared to YOLOv12m; on COCO, energy reductions reach up to 53.7% on GPU and 51.6% on NPU at small scale.

Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy-energy tradeoffs under sparse hardware measurements. Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy-accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2-20 target-device samples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes