CVApr 18

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

Xudong Li, Jiaxi Tan, Ziyin Zhou, Yan Zhong, Zihao Huang, Jingyuan Zheng, Yan Zhang, Xiawu Zheng, Rongrong Ji

arXiv:2604.1685896.0h-index: 6

AI Analysis

For researchers and practitioners in image quality assessment and generative models, Q-DeepSight provides a more reliable and actionable IQA model that can guide iterative image refinement, addressing the need for localized feedback in in-the-loop applications.

Q-DeepSight introduces a think-with-image framework for IQA that uses interleaved multimodal chain-of-thought reasoning with tool-augmented evidence acquisition, achieving SOTA performance across natural, restored, and AI-generated benchmarks. It also enables a training-free iterative refinement framework (PiG) that closes the loop between assessment and enhancement.

Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.

View on arXiv PDF

Similar