LGCVDec 28, 2025

Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

arXiv:2512.23073v1h-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently adapting VLMs for downstream tasks, offering a novel approach that leverages existing model structures, though it appears incremental as it builds on prior MFT work for language models.

The paper tackles the problem of fine-tuning Vision-Language Models (VLMs) by proposing Mask Fine-Tuning (MFT), which uses learnable gating scores to reorganize internal subnetworks instead of updating weights, resulting in consistent performance improvements over methods like LoRA and full fine-tuning.

Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes