CVMar 3, 2025

SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu

arXiv:2503.01754v38.43 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the problem of improving visual reasoning for VLMs, representing an incremental advancement through novel architectural and training enhancements.

The paper tackles the challenge of enhancing reasoning capabilities in Vision-Language Models (VLMs) by proposing a self-distillation framework with diverse reasoning traces, resulting in significant performance improvements across five VQA datasets.

Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.

View on arXiv PDF

Similar