LGNov 28, 2025

Energy-Efficient Vision Transformer Inference for Edge-AI Deployment

arXiv:2511.23166v1

Originality Synthesis-oriented

AI Analysis

This addresses the need for energy-efficient AI deployment on resource-constrained devices, but it is incremental as it builds on existing metrics and models.

The paper tackled the problem of evaluating Vision Transformers (ViTs) for energy efficiency on edge devices by proposing a two-stage pipeline that benchmarks models on ImageNet-1K and CIFAR-10, showing hybrid models like LeViT_Conv_192 reduce energy by up to 53% on an NVIDIA Jetson TX2 relative to a baseline.

The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).

View on arXiv PDF

Similar