LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation
This addresses the issue of modality-specific degradation in VLMs for researchers and practitioners, offering an efficient solution without adding architectural complexity, though it is incremental as it builds on existing distillation techniques.
The paper tackles the problem of linguistic capability degradation in vision-language models (VLMs) due to multimodal adaptation, proposing LinguDistill, an adapter-free distillation method that recovers about 10% of lost performance on language and knowledge benchmarks while maintaining performance on vision-heavy tasks.
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.