WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval
This addresses overfitting in composed image retrieval for vision-language models, which is an incremental improvement for handling limited data scenarios.
The paper tackled overfitting in composed image retrieval (CIR) with limited triplet data by introducing WRF4CIR, a weight-regularized fine-tuning network that applies adversarial perturbations to model weights, resulting in significant narrowing of the generalization gap and substantial improvements over existing methods on benchmark datasets.
Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.