CVOct 5, 2022Code
Making Your First Choice: To Address Cold Start Problem in Vision Active LearningLiangyu Chen, Yutong Bai, Siyu Huang et al. · berkeley, harvard
Active learning promises to improve annotation efficiency by iteratively selecting the most important data to be annotated first. However, we uncover a striking contradiction to this promise: active learning fails to select data as efficiently as random selection at the first few choices. We identify this as the cold start problem in vision active learning, caused by a biased and outlier initial query. This paper seeks to address the cold start problem by exploiting the three advantages of contrastive learning: (1) no annotation is required; (2) label diversity is ensured by pseudo-labels to mitigate bias; (3) typical data is determined by contrastive features to reduce outliers. Experiments are conducted on CIFAR-10-LT and three medical imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell Microscope). Our initial query not only significantly outperforms existing active querying strategies but also surpasses random selection by a large margin. We foresee our solution to the cold start problem as a simple yet strong baseline to choose the initial query for vision active learning. Code is available: https://github.com/c-liangyu/CSVAL
CVNov 23, 2023Code
SinSR: Diffusion-Based Image Super-Resolution in a Single StepYufei Wang, Wenhan Yang, Xinyuan Chen et al.
While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g., 15 iterations). To enhance inference speed, we propose a simple yet effective method for achieving single-step SR generation, named SinSR. Specifically, we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally, we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process, ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model, resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR
IVMar 3, 2023Code
Benchmarking White Blood Cell Classification Under Domain ShiftSatoshi Tsutsui, Zhengyang Su, Bihan Wen
Recognizing the types of white blood cells (WBCs) in microscopic images of human blood smears is a fundamental task in the fields of pathology and hematology. Although previous studies have made significant contributions to the development of methods and datasets, few papers have investigated benchmarks or baselines that others can easily refer to. For instance, we observed notable variations in the reported accuracies of the same Convolutional Neural Network (CNN) model across different studies, yet no public implementation exists to reproduce these results. In this paper, we establish a benchmark for WBC recognition. Our results indicate that CNN-based models achieve high accuracy when trained and tested under similar imaging conditions. However, their performance drops significantly when tested under different conditions. Moreover, the ResNet classifier, which has been widely employed in previous work, exhibits an unreasonably poor generalization ability under domain shifts due to batch normalization. We investigate this issue and suggest some alternative normalization techniques that can mitigate it. We make fully-reproducible code publicly available\footnote{\url{https://github.com/apple2373/wbc-benchmark}}.
CVDec 9, 2022
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow RemovalLanqing Guo, Chong Wang, Wenhan Yang et al.
Recent deep learning methods have achieved promising results in image shadow removal. However, their restored images still suffer from unsatisfactory boundary artifacts, due to the lack of degradation prior embedding and the deficiency in modeling capacity. Our work addresses these issues by proposing a unified diffusion framework that integrates both the image and degradation priors for highly effective shadow removal. In detail, we first propose a shadow degradation model, which inspires us to build a novel unrolling diffusion model, dubbed ShandowDiffusion. It remarkably improves the model's capacity in shadow removal via progressively refining the desired output with both degradation prior and diffusive generative prior, which by nature can serve as a new strong baseline for image restoration. Furthermore, ShadowDiffusion progressively refines the estimated shadow mask as an auxiliary task of the diffusion generator, which leads to more accurate and robust shadow-free image generation. We conduct extensive experiments on three popular public datasets, including ISTD, ISTD+, and SRD, to validate our method's effectiveness. Compared to the state-of-the-art methods, our model achieves a significant improvement in terms of PSNR, increasing from 31.69dB to 34.73dB over SRD dataset.
56.6CVApr 12
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and ResultsXin Li, Yeying Jin, Suhang Yao et al.
This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
CVDec 20, 2022
Temporal Output Discrepancy for Loss Estimation-based Active LearningSiyu Huang, Tianyang Wang, Haoyi Xiong et al. · harvard
While deep learning succeeds in a wide range of tasks, it highly depends on the massive collection of annotated data which is expensive and time-consuming. To lower the cost of data annotation, active learning has been proposed to interactively query an oracle to annotate a small proportion of informative samples in an unlabeled dataset. Inspired by the fact that the samples with higher loss are usually more informative to the model than the samples with lower loss, in this paper we present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. The core of our approach is a measurement Temporal Output Discrepancy (TOD) that estimates the sample loss by evaluating the discrepancy of outputs given by models at different optimization steps. Our theoretical investigation shows that TOD lower-bounds the accumulated sample loss thus it can be used to select informative unlabeled samples. On basis of TOD, we further develop an effective unlabeled data sampling strategy as well as an unsupervised learning criterion for active learning. Due to the simplicity of TOD, our methods are efficient, flexible, and task-agnostic. Extensive experimental results demonstrate that our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks. In addition, we show that TOD can be utilized to select the best model of potentially the highest testing accuracy from a pool of candidate models.
CVJul 15, 2023
ExposureDiffusion: Learning to Expose for Low-light Image EnhancementYufei Wang, Yi Yu, Wenhan Yang et al.
Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.
LGSep 13, 2022
Learning to Solve Multiple-TSP with Time Window and Rejections via Deep Reinforcement LearningRongkai Zhang, Cong Zhang, Zhiguang Cao et al.
We propose a manager-worker framework based on deep reinforcement learning to tackle a hard yet nontrivial variant of Travelling Salesman Problem (TSP), \ie~multiple-vehicle TSP with time window and rejections (mTSPTWR), where customers who cannot be served before the deadline are subject to rejections. Particularly, in the proposed framework, a manager agent learns to divide mTSPTWR into sub-routing tasks by assigning customers to each vehicle via a Graph Isomorphism Network (GIN) based policy network. A worker agent learns to solve sub-routing tasks by minimizing the cost in terms of both tour length and rejection rate for each vehicle, the maximum of which is then fed back to the manager agent to learn better assignments. Experimental results demonstrate that the proposed framework outperforms strong baselines in terms of higher solution quality and shorter computation time. More importantly, the trained agents also achieve competitive performance for solving unseen larger instances.
CVFeb 25, 2023
Raw Image Reconstruction with Learned Compact MetadataYufei Wang, Yi Yu, Wenhan Yang et al.
While raw images exhibit advantages over sRGB images (e.g., linearity and fine-grained quantization level), they are not widely used by common users due to the large storage requirements. Very recent works propose to compress raw images by designing the sampling masks in the raw image pixel space, leading to suboptimal image representations and redundant metadata. In this paper, we propose a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner. Furthermore, we propose a novel sRGB-guided context model with improved entropy estimation strategies, which leads to better reconstruction quality, smaller size of metadata, and faster speed. We illustrate how the proposed raw image compression scheme can adaptively allocate more bits to image regions that are important from a global perspective. The experimental results show that the proposed method can achieve superior raw image reconstruction results using a smaller size of the metadata on both uncompressed sRGB images and JPEG images.
CVFeb 3, 2023
ShadowFormer: Global Context Helps Image Shadow RemovalLanqing Guo, Siyu Huang, Ding Liu et al.
Recent deep learning methods have achieved promising results in image shadow removal. However, most of the existing approaches focus on working locally within shadow and non-shadow regions, resulting in severe artifacts around the shadow boundaries as well as inconsistent illumination between shadow and non-shadow regions. It is still challenging for the deep shadow removal model to exploit the global contextual correlation between shadow and non-shadow regions. In this work, we first propose a Retinex-based shadow model, from which we derive a novel transformer-based network, dubbed ShandowFormer, to exploit non-shadow regions to help shadow region restoration. A multi-scale channel attention framework is employed to hierarchically capture the global information. Based on that, we propose a Shadow-Interaction Module (SIM) with Shadow-Interaction Attention (SIA) in the bottleneck stage to effectively model the context correlation between shadow and non-shadow regions. We conduct extensive experiments on three popular public datasets, including ISTD, ISTD+, and SRD, to evaluate the proposed method. Our method achieves state-of-the-art performance by using up to 150X fewer model parameters.
AIJun 10, 2022
ABCDE: An Agent-Based Cognitive Development EnvironmentJieyi Ye, Jiafei Duan, Samson Yu et al. · uw
Children's cognitive abilities are sometimes cited as AI benchmarks. How can the most common 1,000 concepts (89\% of everyday use) be learnt in a naturalistic children's setting? Cognitive development in children is about quality, and new concepts can be conveyed via simple examples. Our approach of knowledge scaffolding uses simple objects and actions to convey concepts, like how children are taught. We introduce ABCDE, an interactive 3D environment modeled after a typical playroom for children. It comes with 300+ unique 3D object assets (mostly toys), and a large action space for child and parent agents to interact with objects and each other. ABCDE is the first environment aimed at mimicking a naturalistic setting for cognitive development in children; no other environment focuses on high-level concept learning through learner-teacher interactions. The simulator can be found at https://pypi.org/project/ABCDESim/1.0.0/
CVJun 23, 2023
WBCAtt: A White Blood Cell Dataset Annotated with Detailed Morphological AttributesSatoshi Tsutsui, Winnie Pang, Bihan Wen
The examination of blood samples at a microscopic level plays a fundamental role in clinical diagnostics, influencing a wide range of medical conditions. For instance, an in-depth study of White Blood Cells (WBCs), a crucial component of our blood, is essential for diagnosing blood-related diseases such as leukemia and anemia. While multiple datasets containing WBC images have been proposed, they mostly focus on cell categorization, often lacking the necessary morphological details to explain such categorizations, despite the importance of explainable artificial intelligence (XAI) in medical domains. This paper seeks to address this limitation by introducing comprehensive annotations for WBC images. Through collaboration with pathologists, a thorough literature review, and manual inspection of microscopic images, we have identified 11 morphological attributes associated with the cell and its components (nucleus, cytoplasm, and granules). We then annotated ten thousand WBC images with these attributes. Moreover, we conduct experiments to predict these attributes from images, providing insights beyond basic WBC classification. As the first public dataset to offer such extensive annotations, we also illustrate specific applications that can benefit from our attribute annotations. Overall, our dataset paves the way for interpreting WBC recognition models, further advancing XAI in the fields of pathology and hematology.
CVJun 21, 2023
Beyond Learned Metadata-based Raw Image ReconstructionYufei Wang, Yi Yu, Wenhan Yang et al.
While raw images have distinct advantages over sRGB images, e.g., linearity and fine-grained quantization levels, they are not widely adopted by general users due to their substantial storage requirements. Very recent studies propose to compress raw images by designing sampling masks within the pixel space of the raw image. However, these approaches often leave space for pursuing more effective image representations and compact metadata. In this work, we propose a novel framework that learns a compact representation in the latent space, serving as metadata, in an end-to-end manner. Compared with lossy image compression, we analyze the intrinsic difference of the raw image reconstruction task caused by rich information from the sRGB image. Based on the analysis, a novel backbone design with asymmetric and hybrid spatial feature resolutions is proposed, which significantly improves the rate-distortion performance. Besides, we propose a novel design of the context model, which can better predict the order masks of encoding/decoding based on both the sRGB image and the masks of already processed features. Benefited from the better modeling of the correlation between order masks, the already processed information can be better utilized. Moreover, a novel sRGB-guided adaptive quantization precision strategy, which dynamically assigns varying levels of quantization precision to different regions, further enhances the representation ability of the model. Finally, based on the iterative properties of the proposed context model, we propose a novel strategy to achieve variable bit rates using a single model. This strategy allows for the continuous convergence of a wide range of bit rates. Extensive experimental results demonstrate that the proposed method can achieve better reconstruction quality with a smaller metadata size.
91.7CVMar 30Code
GEditBench v2: A Human-Aligned Benchmark for General Image EditingZhangqi Jiang, Zheng Sun, Xianfang Zeng et al.
Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.
CVFeb 11, 2023
Removing Image Artifacts From Scratched Lens ProtectorsYufei Wang, Renjie Wan, Wenhan Yang et al.
A protector is placed in front of the camera lens for mobile devices to avoid damage, while the protector itself can be easily scratched accidentally, especially for plastic ones. The artifacts appear in a wide variety of patterns, making it difficult to see through them clearly. Removing image artifacts from the scratched lens protector is inherently challenging due to the occasional flare artifacts and the co-occurring interference within mixed artifacts. Though different methods have been proposed for some specific distortions, they seldom consider such inherent challenges. In our work, we consider the inherent challenges in a unified framework with two cooperative modules, which facilitate the performance boost of each other. We also collect a new dataset from the real world to facilitate training and evaluation purposes. The experimental results demonstrate that our method outperforms the baselines qualitatively and quantitatively. The code and datasets will be released after acceptance.
LGMar 31, 2023
Towards Adversarially Robust Continual LearningTao Bai, Chen Chen, Lingjuan Lyu et al.
Recent studies show that models trained by continual learning can achieve the comparable performances as the standard supervised learning and the learning flexibility of continual learning models enables their wide applications in the real world. Deep learning models, however, are shown to be vulnerable to adversarial attacks. Though there are many studies on the model robustness in the context of standard supervised learning, protecting continual learning from adversarial attacks has not yet been investigated. To fill in this research gap, we are the first to study adversarial robustness in continual learning and propose a novel method called \textbf{T}ask-\textbf{A}ware \textbf{B}oundary \textbf{A}ugmentation (TABA) to boost the robustness of continual learning models. With extensive experiments on CIFAR-10 and CIFAR-100, we show the efficacy of adversarial training and TABA in defending adversarial attacks.
CVJul 18, 2024
SegPoint: Segment Any Point Cloud via Large Language ModelShuting He, Henghui Ding, Xudong Jiang et al.
Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance.
CVJul 9, 2023
Enhancing Low-Light Images Using Infrared-Encoded ImagesShulin Tian, Yufei Wang, Renjie Wan et al.
Low-light image enhancement task is essential yet challenging as it is ill-posed intrinsically. Previous arts mainly focus on the low-light images captured in the visible spectrum using pixel-wise loss, which limits the capacity of recovering the brightness, contrast, and texture details due to the small number of income photons. In this work, we propose a novel approach to increase the visibility of images captured under low-light environments by removing the in-camera infrared (IR) cut-off filter, which allows for the capture of more photons and results in improved signal-to-noise ratio due to the inclusion of information from the IR spectrum. To verify the proposed strategy, we collect a paired dataset of low-light images captured without the IR cut-off filter, with corresponding long-exposure reference images with an external filter. The experimental results on the proposed dataset demonstrate the effectiveness of the proposed method, showing better performance quantitatively and qualitatively. The dataset and code are publicly available at https://wyf0912.github.io/ELIEI/
IVMar 3, 2023
Unsupervised Deep Digital Staining For Microscopic Cell Images Via Knowledge DistillationZiwang Xu, Lanqing Guo, Shuyan Zhang et al.
Staining is critical to cell imaging and medical diagnosis, which is expensive, time-consuming, labor-intensive, and causes irreversible changes to cell tissues. Recent advances in deep learning enabled digital staining via supervised model training. However, it is difficult to obtain large-scale stained/unstained cell image pairs in practice, which need to be perfectly aligned with the supervision. In this work, we propose a novel unsupervised deep learning framework for the digital staining of cell images using knowledge distillation and generative adversarial networks (GANs). A teacher model is first trained mainly for the colorization of bright-field images. After that,a student GAN for staining is obtained by knowledge distillation with hybrid non-reference losses. We show that the proposed unsupervised deep staining method can generate stained images with more accurate positions and shapes of the cell targets. Compared with other unsupervised deep generative models for staining, our method achieves much more promising results both qualitatively and quantitatively.
SINov 12, 2025Code
Conformal Prediction for Multi-Source Detection on a NetworkXingchao Jian, Purui Zhang, Lan Tian et al.
Detecting the origin of information or infection spread in networks is a fundamental challenge with applications in misinformation tracking, epidemiology, and beyond. We study the multi-source detection problem: given snapshot observations of node infection status on a graph, estimate the set of source nodes that initiated the propagation. Existing methods either lack statistical guarantees or are limited to specific diffusion models and assumptions. We propose a novel conformal prediction framework that provides statistically valid recall guarantees for source set detection, independent of the underlying diffusion process or data distribution. Our approach introduces principled score functions to quantify the alignment between predicted probabilities and true sources, and leverages a calibration set to construct prediction sets with user-specified recall and coverage levels. The method is applicable to both single- and multi-source scenarios, supports general network diffusion dynamics, and is computationally efficient for large graphs. Empirical results demonstrate that our method achieves rigorous coverage with competitive accuracy, outperforming existing baselines in both reliability and scalability.The code is available online.
CVJul 25, 2024Code
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile PhotographyKailai Zhou, Lijing Cai, Yibo Wang et al.
The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spectral decomposition model guided enhancement framework, which consists of two steps: joint decomposition and prior-guided enhancement. Firstly, we leverage the complementarity between RGB and Low-resolution Multi-Spectral Images (Lr-MSI) to predict shading, reflectance, and material semantic priors. Subsequently, these priors are seamlessly integrated into the established HDRNet to promote dynamic range enhancement, color mapping, and grid expert learning, respectively. Additionally, we construct a high-quality Mobile-Spec dataset to support our research, and our experiments validate the effectiveness of Lr-MSI in the tone enhancement task. This work aims to establish a solid foundation for advancing spectral vision in mobile photography. The code is available at \url{https://github.com/CalayZhou/JDM-HDRNet}.
IVJul 25, 2022
REPNP: Plug-and-Play with Deep Reinforcement Learning Prior for Robust Image RestorationChong Wang, Rongkai Zhang, Saiprasad Ravishankar et al.
Image restoration schemes based on the pre-trained deep models have received great attention due to their unique flexibility for solving various inverse problems. In particular, the Plug-and-Play (PnP) framework is a popular and powerful tool that can integrate an off-the-shelf deep denoiser for different image restoration tasks with known observation models. However, obtaining the observation model that exactly matches the actual one can be challenging in practice. Thus, the PnP schemes with conventional deep denoisers may fail to generate satisfying results in some real-world image restoration tasks. We argue that the robustness of the PnP framework is largely limited by using the off-the-shelf deep denoisers that are trained by deterministic optimization. To this end, we propose a novel deep reinforcement learning (DRL) based PnP framework, dubbed RePNP, by leveraging a light-weight DRL-based denoiser for robust image restoration tasks. Experimental results demonstrate that the proposed RePNP is robust to the observation model used in the PnP scheme deviating from the actual one. Thus, RePNP can generate more reliable restoration results for image deblurring and super resolution tasks. Compared with several state-of-the-art deep image restoration baselines, RePNP achieves better results subjective to model deviation with fewer model parameters.
CVMar 18, 2023
Confidence Attention and Generalization Enhanced Distillation for Continuous Video Domain AdaptationXiyu Wang, Yuecong Xu, Jianfei Yang et al.
Continuous Video Domain Adaptation (CVDA) is a scenario where a source model is required to adapt to a series of individually available changing target domains continuously without source data or target supervision. It has wide applications, such as robotic vision and autonomous driving. The main underlying challenge of CVDA is to learn helpful information only from the unsupervised target data while avoiding forgetting previously learned knowledge catastrophically, which is out of the capability of previous Video-based Unsupervised Domain Adaptation methods. Therefore, we propose a Confidence-Attentive network with geneRalization enhanced self-knowledge disTillation (CART) to address the challenge in CVDA. Firstly, to learn from unsupervised domains, we propose to learn from pseudo labels. However, in continuous adaptation, prediction errors can accumulate rapidly in pseudo labels, and CART effectively tackles this problem with two key modules. Specifically, The first module generates refined pseudo labels using model predictions and deploys a novel attentive learning strategy. The second module compares the outputs of augmented data from the current model to the outputs of weakly augmented data from the source model, forming a novel consistency regularization on the model to alleviate the accumulation of prediction errors. Extensive experiments suggest that the CVDA performance of CART outperforms existing methods by a considerable margin.
56.6CVMay 13Code
Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive ImagingWudi Chen, Zhiyuan Zha, Xin Yuan et al.
Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: https://github.com/PaiDii/Phy-CoSF.git.
SPSep 12, 2023
Spectral Convergence of Complexon Shift OperatorsPurui Zhang, Xingchao Jian, Feng Ji et al.
Topological Signal Processing (TSP) utilizes simplicial complexes to model structures with higher order than vertices and edges. In this paper, we study the transferability of TSP via a generalized higher-order version of graphon, known as complexon. We recall the notion of a complexon as the limit of a simplicial complex sequence [1]. Inspired by the graphon shift operator and message-passing neural network, we construct a marginal complexon and complexon shift operator (CSO) according to components of all possible dimensions from the complexon. We investigate the CSO's eigenvalues and eigenvectors and relate them to a new family of weighted adjacency matrices. We prove that when a simplicial complex signal sequence converges to a complexon signal, the eigenvalues, eigenspaces, and Fourier transform of the corresponding CSOs converge to that of the limit complexon signal. This conclusion is further verified by two numerical experiments. These results hint at learning transferability on large simplicial complexes or simplicial complex sequences, which generalize the graphon signal processing framework.
CVJul 11, 2024
Single-Image Shadow Removal Using Deep Learning: A Comprehensive SurveyLaniqng Guo, Chong Wang, Yufei Wang et al.
Shadow removal aims at restoring the image content within shadow regions, pursuing a uniform distribution of illumination that is consistent between shadow and non-shadow regions. {Comparing to other image restoration tasks, there are two unique challenges in shadow removal:} 1) The patterns of shadows are arbitrary, varied, and often have highly complex trace structures, making ``trace-less'' image recovery difficult. 2) The degradation caused by shadows is spatially non-uniform, resulting in inconsistencies in illumination and color between shadow and non-shadow areas. Recent developments in this field are primarily driven by deep learning-based solutions, employing a variety of learning strategies, network architectures, loss functions, and training data. Nevertheless, a thorough and insightful review of deep learning-based shadow removal techniques is still lacking. In this paper, we are the first to provide a comprehensive survey to cover various aspects ranging from technical details to applications. We highlight the major advancements in deep learning-based single-image shadow removal methods, thoroughly review previous research across various categories, and provide insights into the historical progression of these developments. Additionally, we summarize performance comparisons both quantitatively and qualitatively. Beyond the technical aspects of shadow removal methods, we also explore potential future directions for this field.
CVSep 17, 2024
Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image DenoisersZixuan Fu, Lanqing Guo, Chong Wang et al.
Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more readily available. Thus, a well-trained image denoiser could serve as a reliable spatial prior for video denoising. In this paper, we propose a novel unsupervised video denoising framework, named ``Temporal As a Plugin'' (TAP), which integrates tunable temporal modules into a pre-trained image denoiser. By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising. Furthermore, we introduce a progressive fine-tuning strategy that refines each temporal module using the generated pseudo clean video frames, progressively enhancing the network's denoising performance. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.
CVSep 11, 2024
SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow RemovalXinrui Wang, Lanqing Guo, Xiyu Wang et al.
Recent advancements in deep learning have yielded promising results for the image shadow removal task. However, most existing methods rely on binary pre-generated shadow masks. The binary nature of such masks could potentially lead to artifacts near the boundary between shadow and non-shadow areas. In view of this, inspired by the physical model of shadow formation, we introduce novel soft shadow masks specifically designed for shadow removal. To achieve such soft masks, we propose a SoftShadow framework by leveraging the prior knowledge of pretrained SAM and integrating physical constraints. Specifically, we jointly tune the SAM and the subsequent shadow removal network using penumbra formation constraint loss, mask reconstruction loss, and shadow removal loss. This framework enables accurate predictions of penumbra (partially shaded) and umbra (fully shaded) areas while simultaneously facilitating end-to-end shadow removal. Through extensive experiments on popular datasets, we found that our SoftShadow framework, which generates soft masks, can better restore boundary artifacts, achieve state-of-the-art performance, and demonstrate superior generalizability.
92.9CVMay 21
EvoVid: Temporal-Centric Self-Evolution for Video Large Language ModelsShiqi Huang, Ziyue Wang, Zhongrong Zuo et al.
Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.
20.1CVMay 19
WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell ImagesSatoshi Tsutsui, Winnie Pang, Shuting He et al.
The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.
CVMay 1, 2024Code
EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language ModelDeng Li, Xin Liu, Bohao Xing et al.
Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
CVDec 17, 2024Code
ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance SegmentationShiqi Huang, Shuting He, Bihan Wen
Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed-set predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter-class similarity and intra-class variance. Besides, the domain gap between vision-language models' pretraining datasets and remote sensing datasets hinders the zero-shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a $\textbf{Z}$ero-Sh$\textbf{o}$t $\textbf{R}$emote Sensing $\textbf{I}$nstance Segmentation framework, dubbed $\textbf{ZoRI}$. Our approach features a discrimination-enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine-tuning, we propose a knowledge-maintained adaptation strategy that decouples semantic-related information to preserve the pretrained vision-language alignment while adjusting features to capture remote sensing domain-specific visual cues. Additionally, we introduce a prior-injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task. Our code is available at https://github.com/HuangShiqi128/ZoRI.
CVNov 14, 2025
RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian SplattingRuocheng Wu, Haolan He, Yufei Wang et al.
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
CVJul 9, 2024
Integrating Clinical Knowledge into Concept Bottleneck ModelsWinnie Pang, Xueyi Ke, Satoshi Tsutsui et al.
Concept bottleneck models (CBMs), which predict human-interpretable concepts (e.g., nucleus shapes in cell images) before predicting the final output (e.g., cell type), provide insights into the decision-making processes of the model. However, training CBMs solely in a data-driven manner can introduce undesirable biases, which may compromise prediction performance, especially when the trained models are evaluated on out-of-domain images (e.g., those acquired using different devices). To mitigate this challenge, we propose integrating clinical knowledge to refine CBMs, better aligning them with clinicians' decision-making processes. Specifically, we guide the model to prioritize the concepts that clinicians also prioritize. We validate our approach on two datasets of medical images: white blood cell and skin images. Empirical validation demonstrates that incorporating medical guidance enhances the model's classification performance on unseen datasets with varying preparation methods, thereby increasing its real-world applicability.
CVApr 27, 2025Code
PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover ClassificationHuiling Zheng, Xian Zhong, Bin Liu et al.
The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and underexploited spectral complementarity. Existing approaches often fail to decouple shared structural features from modality-complementary radiometric attributes, resulting in feature conflicts and information loss. To address this, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-complementary) components in the Fourier domain. This design reinforces shared structures while preserving complementary characteristics, thereby enhancing fusion quality. Unlike previous methods that overlook the distinct physical properties encoded in frequency spectra, PAD explicitly introduces amplitude-phase decoupling for multi-modal fusion. Specifically, PAD comprises two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features via convolution-guided scaling to improve geometric consistency; and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high- and low-frequency patterns using frequency-adaptive multilayer perceptrons, effectively exploiting SAR's morphological sensitivity and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK demonstrate state-of-the-art performance. This work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.
CVJan 4Code
Improving Flexible Image Tokenizers for Autoregressive Image GenerationZixuan Fu, Lanqing Guo, Chong Wang et al.
Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbf{ReToK}, a flexible tokenizer with \underline{Re}dundant \underline{Tok}en Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbf{Redundant Token Padding} to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbf{Hierarchical Semantic Regularization} to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256$\times$256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \href{https://github.com/zfu006/ReTok}{https://github.com/zfu006/ReTok}
CVJul 22, 2025Code
M-SpecGene: Generalized Foundation Model for RGBT Multispectral VisionKailai Zhou, Fuqiang Yang, Shixian Wang et al.
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.
CVJul 17, 2025Code
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance SegmentationShiqi Huang, Shuting He, Huaiyuan Qin et al.
Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.
CVMay 15, 2023Code
Denoising Diffusion Models for Plug-and-Play Image RestorationYuanzhi Zhu, Kai Zhang, Jingyun Liang et al.
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {\url{https://github.com/yuanzhi-zhu/DiffPIR}}
CVJun 24, 2020Code
Exploiting Non-Local Priors via Self-Convolution For Highly-Efficient Image RestorationLanqing Guo, Zhiyuan Zha, Saiprasad Ravishankar et al.
Constructing effective image priors is critical to solving ill-posed inverse problems in image processing and imaging. Recent works proposed to exploit image non-local similarity for inverse problems by grouping similar patches and demonstrated state-of-the-art results in many applications. However, compared to classic methods based on filtering or sparsity, most of the non-local algorithms are time-consuming, mainly due to the highly inefficient and redundant block matching step, where the distance between each pair of overlapping patches needs to be computed. In this work, we propose a novel Self-Convolution operator to exploit image non-local similarity in a self-supervised way. The proposed Self-Convolution can generalize the commonly-used block matching step and produce equivalent results with much cheaper computation. Furthermore, by applying Self-Convolution, we propose an effective multi-modality image restoration scheme, which is much more efficient than conventional block matching for non-local modeling. Experimental results demonstrate that (1) Self-Convolution can significantly speed up most of the popular non-local image restoration algorithms, with two-fold to nine-fold faster block matching, and (2) the proposed multi-modality image restoration scheme achieves superior denoising results in both efficiency and effectiveness on RGB-NIR images. The code is publicly available at \href{https://github.com/GuoLanqing/Self-Convolution}.
CVSep 6, 2018Code
Connecting Image Denoising and High-Level Vision Tasks via Deep LearningDing Liu, Bihan Wen, Jianbo Jiao et al.
Image denoising and high-level vision tasks are usually handled independently in the conventional practice of computer vision, and their connection is fragile. In this paper, we cope with the two jointly and explore the mutual influence between them with the focus on two questions, namely (1) how image denoising can help improving high-level vision tasks, and (2) how the semantic information from high-level vision tasks can be used to guide image denoising. First for image denoising we propose a convolutional neural network in which convolutions are conducted in various spatial resolutions via downsampling and upsampling operations in order to fuse and exploit contextual information on different scales. Second we propose a deep neural network solution that cascades two modules for image denoising and various high-level tasks, respectively, and use the joint loss for updating only the denoising network via back-propagation. We experimentally show that on one hand, the proposed denoiser has the generality to overcome the performance degradation of different high-level vision tasks. On the other hand, with the guidance of high-level vision information, the denoising network produces more visually appealing results. Extensive experiments demonstrate the benefit of exploiting image semantics simultaneously for image denoising and high-level vision tasks via deep learning. The code is available online: https://github.com/Ding-Liu/DeepDenoising
CVJun 14, 2017Code
When Image Denoising Meets High-Level Vision Tasks: A Deep Learning ApproachDing Liu, Bihan Wen, Xianming Liu et al.
Conventionally, image denoising and high-level vision tasks are handled separately in computer vision. In this paper, we cope with the two jointly and explore the mutual influence between them. First we propose a convolutional neural network for image denoising which achieves the state-of-the-art performance. Second we propose a deep neural network solution that cascades two modules for image denoising and various high-level tasks, respectively, and use the joint loss for updating only the denoising network via back-propagation. We demonstrate that on one hand, the proposed denoiser has the generality to overcome the performance degradation of different high-level vision tasks. On the other hand, with the guidance of high-level vision information, the denoising network can generate more visually appealing results. To the best of our knowledge, this is the first work investigating the benefit of exploiting image semantics simultaneously for image denoising and high-level vision tasks via deep learning. The code is available online https://github.com/Ding-Liu/DeepDenoising.
CVFeb 16, 2024
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution AdaptationLanqing Guo, Yingqing He, Haoxin Chen et al.
Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.
CVJan 29
RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial ReasoningShiqi Huang, Shuting He, Bihan Wen
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.
CVApr 17, 2025
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and ResultsXin Li, Yeying Jin, Xin Jin et al.
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
LGDec 22, 2025
Learning Through Little Eyes: Attribute Discrimination Beyond ObjectsPatrick Batsell, Tsutsui Satoshi, Bihan Wen
Infants learn to recognize not only object categories but also fine grained attributes such as color, size, and texture within their first two years of life. Prior work explores Childs View for Contrastive Learning (CVCL), a CLIP style model trained on infant egocentric video as a computational model of early infant learning, but it focuses only on class level recognition. This leaves it unclear whether infant scale learning also supports attribute discrimination. To address this, we introduce a benchmark that systematically varies color, size, and texture, allowing controlled tests of within class attribute recognition. Comparing CVCL with CLIP shows clear differences. CVCL is better at size discrimination, while CLIP achieves higher accuracy on color discrimination. Both models represent texture in image embeddings but fail to ground texture linguistically, suggesting a gap between visual and language spaces.
CVMay 20, 2025
Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes ModelingZhihao Li, Yufei Wang, Heliang Zheng et al.
High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce Sparc3D, a unified framework that combines a sparse deformable marching cubes representation Sparcubes with a novel encoder Sparconv-VAE. Sparcubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. Sparconv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. Sparc3D achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
IVMar 15, 2024
Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRIChong Wang, Lanqing Guo, Yufei Wang et al.
Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this work, we propose a Progressive Divide-And-Conquer (PDAC) strategy, aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI, we present a rigorous derivation of the proposed PDAC framework, which could be further unfolded into an end-to-end trainable network. Specifically, each iterative stage in PDAC focuses on recovering a distinct moderate degradation according to the decomposition. Furthermore, as part of the PDAC iteration, such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction, the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings.
CVMar 15, 2024
Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive AttacksChong Wang, Yi Yu, Lanqing Guo et al.
Shadow removal is a task aimed at erasing regional shadows present in images and reinstating visually pleasing natural scenes with consistent illumination. While recent deep learning techniques have demonstrated impressive performance in image shadow removal, their robustness against adversarial attacks remains largely unexplored. Furthermore, many existing attack frameworks typically allocate a uniform budget for perturbations across the entire input image, which may not be suitable for attacking shadow images. This is primarily due to the unique characteristic of spatially varying illumination within shadow images. In this paper, we propose a novel approach, called shadow-adaptive adversarial attack. Different from standard adversarial attacks, our attack budget is adjusted based on the pixel intensity in different regions of shadow images. Consequently, the optimized adversarial noise in the shadowed regions becomes visually less perceptible while permitting a greater tolerance for perturbations in non-shadow regions. The proposed shadow-adaptive attacks naturally align with the varying illumination distribution in shadow images, resulting in perturbations that are less conspicuous. Building on this, we conduct a comprehensive empirical evaluation of existing shadow removal methods, subjecting them to various levels of attack on publicly available datasets.
CVMay 20, 2024
Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion ModelsXiyu Wang, Yufei Wang, Satoshi Tsutsui et al.
Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potential character leakage and inconsistent text labeling, and (2) the challenge of distinguishing between new and old characters, leading to ambiguous results. To address these challenges, we introduce the NewEpisode benchmark, comprising refined datasets designed to evaluate generative models' adaptability in generating new stories with fresh characters using just a single example story. The refined dataset involves refined text prompts and eliminates character leakage. Additionally, to mitigate the character confusion of generated results, we propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics. EpicEvo introduces a novel adversarial character alignment module to align the generated images progressively in the diffusive process, with exemplar images of new characters, while applying knowledge distillation to prevent forgetting of characters and background details. Our evaluation quantitatively demonstrates that EpicEvo outperforms existing baselines on the NewEpisode benchmark, and qualitative studies confirm its superior customization of visual story generation in diffusion models. In summary, EpicEvo provides an effective way to incorporate new characters using only one example story, unlocking new possibilities for applications such as serialized cartoons.