CVAIApr 4, 2025

NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

arXiv:2504.03118v11 citationsh-index: 36
Originality Incremental advance
AI Analysis

This addresses the need for efficient, accurate vision models on resource-constrained edge devices, representing an incremental improvement in model optimization.

The paper tackles the problem of Vision Transformers (ViTs) being over-qualified for edge devices by proposing NuWa, which derives small, task-specific ViTs from base models, improving accuracy by up to 11.83% and accelerating inference by 1.29x to 2.79x.

Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices' diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textit{over-qualified} for edge devices that usually demand only part of a ViT's knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to $\text{11.83}\%$ and accelerates model inference by 1.29$\times$ - 2.79$\times$. Code for reproduction is available at https://anonymous.4open.science/r/Task_Specific-3A5E.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes