CVMar 7, 2024

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

arXiv:2403.04908v313 citationsh-index: 6ECCV
Originality Highly original
AI Analysis

This work addresses the challenge of efficient edge deployment for vision-language models, which is incremental as it builds on existing large models like CLIP.

The paper tackles the problem of deploying large vision-language models on edge devices by introducing EdgeVL, a framework that adapts models like CLIP for efficient use with RGB and non-RGB images without manual annotations, resulting in up to 15.4% accuracy improvements and up to 93-fold model size reduction.

Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. We introduce EdgeVL, a novel framework that bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models, like CLIP, for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations. EdgeVL not only transfers visual language alignment capabilities to compact models but also maintains feature quality post-quantization, significantly enhancing open-vocabulary classification performance across various visual modalities. Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes