AI CVOct 21, 2024

Improve Vision Language Model Chain-of-thought Reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang

CMU

arXiv:2410.16198v138.5144 citationsh-index: 25Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the need for more interpretable and trustworthy vision language models, but it is incremental as it builds on existing methods like distillation and reinforcement learning.

The paper tackled the problem of poor chain-of-thought reasoning in vision language models due to training on short annotations, by distilling rationales from GPT-4o and applying reinforcement learning with Direct Preference Optimization, resulting in significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction.

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

View on arXiv PDF Code

Similar