CVAIJun 7, 2024

Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

arXiv:2406.04820v1
Originality Incremental advance
AI Analysis

This work addresses efficiency problems for mobile and embedded vision applications, but it is incremental as it builds on existing MobileViT methods.

The paper tackles the challenge of computational inefficiency in vision transformers (ViTs) by using Gaussian processes to optimize global architecture factors of MobileViT, resulting in smaller models that achieve higher accuracy with reduced computational costs, as shown by outperforming CNNs and mobile ViTs across diverse datasets.

Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes