Top-Down Semantic Refinement for Image Captioning
This addresses a fundamental limitation in large vision-language models for tasks requiring multi-step and complex scene description, offering a plug-and-play module to enhance performance in fine-grained description, compositional generalization, and hallucination suppression.
The paper tackles the problem of maintaining global narrative coherence while capturing rich details in image captioning by redefining it as a goal-oriented hierarchical refinement planning problem, proposing the Top-Down Semantic Refinement (TDSR) framework with an efficient Monte Carlo Tree Search algorithm that reduces VLM calls by an order of magnitude and achieves state-of-the-art or competitive results on benchmarks like DetailCaps, COMPOSITIONCAP, and POPE.
Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.