CVMar 22, 2025

Visual Variational Autoencoder Prompt Tuning

arXiv:2503.17650v18 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the need for more adaptive and efficient fine-tuning of large vision transformers for downstream tasks, representing an incremental advancement in parameter-efficient fine-tuning methods.

The paper tackled the problem of static, domain-specific prompts in visual prompt tuning by introducing V^2APT, a framework that generates dynamic, input-dependent prompts using a variational autoencoder, resulting in a +3.2% improvement over VPT-Deep on HTA and an average gain of +2.0% across benchmarks.

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes