CVMay 12

Revisiting Shadow Detection from a Vision-Language Perspective

arXiv:2605.1177173.8
Predicted impact top 37% in CV · last 90 daysOriginality Highly original
AI Analysis

For computer vision researchers, this work addresses the long-standing ambiguity in shadow detection by integrating language cues, offering a new paradigm that could generalize to other dense prediction tasks.

Shadow detection is reformulated as a vision-language task to resolve visual ambiguities where shadows and dark surfaces appear similar. The proposed SVL framework uses language as semantic reference, achieving strong performance with less than 1% trainable parameters and improved robustness on hard cases.

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes