Dingming Zhang

CV
5papers
12citations
Novelty66%
AI Score55

5 Papers

85.0CVMay 14
RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

Chunming He, Rihan Zhang, Dingming Zhang et al.

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

12.8CVApr 22
StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Liqi Jing, Dingming Zhang, Peinian Li et al.

We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

CVMar 9
QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

Fengyang Xiao, Jingjia Feng, Peng Hu et al.

Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.

CVFeb 1
Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion

Chunming He, Rihan Zhang, Fengyang Xiao et al.

Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.

CVNov 22, 2025
Nested Unfolding Network for Real-World Concealed Object Segmentation

Chunming He, Rihan Zhang, Dingming Zhang et al.

Deep unfolding networks (DUNs) have recently advanced concealed object segmentation (COS) by modeling segmentation as iterative foreground-background separation. However, existing DUN-based methods (RUN) inherently couple background estimation with image restoration, leading to conflicting objectives and requiring pre-defined degradation types, which are unrealistic in real-world scenarios. To address this, we propose the nested unfolding network (NUN), a unified framework for real-world COS. NUN adopts a DUN-in-DUN design, embedding a degradation-resistant unfolding network (DeRUN) within each stage of a segmentation-oriented unfolding network (SODUN). This design decouples restoration from segmentation while allowing mutual refinement. Guided by a vision-language model (VLM), DeRUN dynamically infers degradation semantics and restores high-quality images without explicit priors, whereas SODUN performs reversible estimation to refine foreground and background. Leveraging the multi-stage nature of unfolding, NUN employs image-quality assessment to select the best DeRUN outputs for subsequent stages, naturally introducing a self-consistency loss that enhances robustness. Extensive experiments show that NUN achieves a leading place on both clean and degraded benchmarks. Code will be released.