M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
This addresses efficiency challenges for researchers and practitioners in vision-language tasks, though it is incremental as it builds on existing parameter-efficient transfer learning methods.
The paper tackles the high computational cost of fine-tuning pre-trained vision-language models for referring expression comprehension by proposing M2IST, a parameter-efficient method that achieves competitive performance with only 2.11% tunable parameters, 39.61% GPU memory, and 63.46% training time.
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M3ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M2IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11\% tunable parameters, 39.61\% GPU memory, and 63.46\% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.