CVSep 4, 2024

iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

Hayeon Jo, Hyesong Choi, Minhee Cho, Dongbo Min

arXiv:2409.02838v26.53 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and flexible adaptation of large pre-trained models for diverse downstream tasks in computer vision, offering an incremental improvement over existing PEFT methods.

The paper tackles the inflexibility of parameter-efficient fine-tuning (PEFT) adapters by proposing iConFormer, a dynamic adapter conditioned on input instances, which achieves performance comparable to full fine-tuning in tasks like depth estimation and segmentation while tuning only 1.6% to 2.8% of parameters.

Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.

View on arXiv PDF

Similar