CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
This addresses the limitation of LVLMs in perception tasks for applications requiring dense scene understanding, though it is an incremental improvement over existing methods.
The paper tackles the problem of poor performance of large vision-language models on perception-oriented tasks like object detection, introducing CoT4Det, a chain-of-thought framework that reformulates these tasks into interpretable steps, resulting in a boost from 19.0% to 33.0% mAP on COCO2017 val and competitive gains on other benchmarks.
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.