SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
This work addresses the problem of improving visual reasoning for VLMs, representing an incremental advancement through novel architectural and training enhancements.
The paper tackles the challenge of enhancing reasoning capabilities in Vision-Language Models (VLMs) by proposing a self-distillation framework with diverse reasoning traces, resulting in significant performance improvements across five VQA datasets.
Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.