CVApr 6

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

arXiv:2604.0478080.71 citations
AI Analysis

This work addresses a critical issue for real-world applications of multimodal AI by enhancing model robustness to degraded images, though it is incremental as it builds on existing unified architectures.

The paper tackles the problem of image degradation undermining multimodal understanding in unified models by proposing CLEAR, a framework that connects generative and reasoning capabilities through supervised fine-tuning, a latent representation bridge, and reinforcement learning, resulting in substantial robustness improvements on degraded inputs while preserving clean-image performance.

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes