CVNov 27, 2023Code
Single-Model and Any-Modality for Video Object TrackingZongwei Wu, Jilai Zheng, Xiangxuan Ren et al.
In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts, validating our effectiveness and practicality. The source code is publicly available at https://github.com/Zongwei97/UnTrack.
CVApr 13Code
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and ResultsXingyu Qiu, Yuqian Fu, Jiawei Geng et al.
Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.
CVApr 11, 2023Code
CamDiff: Camouflage Image Augmentation via Diffusion ModelXue-Jing Luo, Shuo Wang, Zongwei Wu et al.
The burgeoning field of camouflaged object detection (COD) seeks to identify objects that blend into their surroundings. Despite the impressive performance of recent models, we have identified a limitation in their robustness, where existing methods may misclassify salient objects as camouflaged ones, despite these two characteristics being contradictory. This limitation may stem from lacking multi-pattern training images, leading to less saliency robustness. To address this issue, we introduce CamDiff, a novel approach inspired by AI-Generated Content (AIGC) that overcomes the scarcity of multi-pattern training images. Specifically, we leverage the latent diffusion model to synthesize salient objects in camouflaged scenes, while using the zero-shot image classification ability of the Contrastive Language-Image Pre-training (CLIP) model to prevent synthesis failures and ensure the synthesized object aligns with the input prompt. Consequently, the synthesized image retains its original camouflage label while incorporating salient objects, yielding camouflage samples with richer characteristics. The results of user studies show that the salient objects in the scenes synthesized by our framework attract the user's attention more; thus, such samples pose a greater challenge to the existing COD models. Our approach enables flexible editing and efficient large-scale dataset generation at a low cost. It significantly enhances COD baselines' training and testing phases, emphasizing robustness across diverse domains. Our newly-generated datasets and source code are available at https://github.com/drlxj/CamDiff.
CVApr 28, 2023Code
Event-Free Moving Object Segmentation from Moving Ego VehicleZhuyun Zhou, Zongwei Wu, Danda Pani Paudel et al.
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving, especially for sequences obtained from moving ego vehicles. Most segmentation methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within the inter-frame, consequently constraining its ability to discern objects exhibiting relative staticity but genuinely in motion. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on our dataset. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event temporal prior with spatial semantic maps to distinguish genuinely moving objects from the static background, adding another level of dense supervision around our object of interest. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. The exhaustive comparison highlights a significant performance improvement of our method over all other methods. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/DSEC-MOS.
CVSep 17, 2022Code
RGB-Event Fusion for Moving Object Detection in Autonomous DrivingZhuyun Zhou, Zongwei Wu, Rémi Boutteau et al.
Moving Object Detection (MOD) is a critical vision task for successfully achieving safe autonomous driving. Despite plausible results of deep learning methods, most existing approaches are only frame-based and may fail to reach reasonable performance when dealing with dynamic traffic participants. Recent advances in sensor technologies, especially the Event camera, can naturally complement the conventional camera approach to better model moving objects. However, event-based works often adopt a pre-defined time window for event representation, and simply integrate it to estimate image intensities from events, neglecting much of the rich temporal information from the available asynchronous events. Therefore, from a new perspective, we propose RENet, a novel RGB-Event fusion Network, that jointly exploits the two complementary modalities to achieve more robust MOD under challenging scenarios for autonomous driving. Specifically, we first design a temporal multi-scale aggregation module to fully leverage event frames from both the RGB exposure time and larger intervals. Then we introduce a bi-directional fusion module to attentively calibrate and fuse multi-modal features. To evaluate the performance of our network, we carefully select and annotate a sub-MOD dataset from the commonly used DSEC dataset. Extensive experiments demonstrate that our proposed method performs significantly better than the state-of-the-art RGB-Event fusion alternatives. The source code and dataset are publicly available at: https://github.com/ZZY-Zhou/RENet.
CVDec 10, 2022
Source-free Depth for Object Pop-outZongwei Wu, Danda Pani Paudel, Deng-Ping Fan et al.
Depth cues are known to be useful for visual perception. However, direct measurement of depth is often impracticable. Fortunately, though, modern learning-based methods offer promising depth maps by inference in the wild. In this work, we adapt such depth inference models for object segmentation using the objects' "pop-out" prior in 3D. The "pop-out" is a simple composition prior that assumes objects reside on the background surface. Such compositional prior allows us to reason about objects in the 3D space. More specifically, we adapt the inferred depth maps such that objects can be localized using only 3D information. Such separation, however, requires knowledge about contact surface which we learn using the weak supervision of the segmentation mask. Our intermediate representation of contact surface, and thereby reasoning about objects purely in 3D, allows us to better transfer the depth knowledge into semantics. The proposed adaptation method uses only the depth model without needing the source data used for training, making the learning process efficient and practical. Our experiments on eight datasets of two challenging tasks, namely camouflaged object detection and salient object detection, consistently demonstrate the benefit of our method in terms of both performance and generalizability.
CVApr 12
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and ResultsXin Li, Yeying Jin, Suhang Yao et al.
This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
CVApr 8
NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and ResultsWenbin Zou, Tianyi Li, Kejun Wu et al.
This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.
CVJan 18, 2023
HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth AwarenessZongwei Wu, Guillaume Allibert, Fabrice Meriaudeau et al.
RGB-D saliency detection aims to fuse multi-modal cues to accurately localize salient regions. Existing works often adopt attention modules for feature modeling, with few methods explicitly leveraging fine-grained details to merge with semantic cues. Thus, despite the auxiliary depth information, it is still challenging for existing models to distinguish objects with similar appearances but at distinct camera distances. In this paper, from a new perspective, we propose a novel Hierarchical Depth Awareness network (HiDAnet) for RGB-D saliency detection. Our motivation comes from the observation that the multi-granularity properties of geometric priors correlate well with the neural network hierarchies. To realize multi-modal and multi-level fusion, we first use a granularity-based attention scheme to strengthen the discriminatory power of RGB and depth features separately. Then we introduce a unified cross dual-attention module for multi-modal and multi-level fusion in a coarse-to-fine manner. The encoded multi-modal features are gradually aggregated into a shared decoder. Further, we exploit a multi-scale loss to take full advantage of the hierarchical information. Extensive experiments on challenging benchmark datasets demonstrate that our HiDAnet performs favorably over the state-of-the-art methods by large margins.
CVAug 23, 2023Code
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic VideosZiyu Yang, Sucheng Ren, Zongwei Wu et al.
Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}}.
CVSep 14, 2023
Temporal-aware Hierarchical Mask Classification for Video Semantic SegmentationZhaochong An, Guolei Sun, Zongwei Wu et al.
Modern approaches have proved the huge potential of addressing semantic segmentation as a mask classification task which is widely used in instance-level segmentation. This paradigm trains models by assigning part of object queries to ground truths via conventional one-to-one matching. However, we observe that the popular video semantic segmentation (VSS) dataset has limited categories per video, meaning less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. This inefficiency limits the full expressive potential of all queries.Thus, we present a novel solution THE-Mask for VSS, which introduces temporal-aware hierarchical object queries for the first time. Specifically, we propose to use a simple two-round matching mechanism to involve more queries matched with minimal cost during training while without any extra cost during inference. To support our more-to-one assignment, in terms of the matching results, we further design a hierarchical loss to train queries with their corresponding hierarchy of primary or secondary. Moreover, to effectively capture temporal information across frames, we propose a temporal aggregation decoder that fits seamlessly into the mask-classification paradigm for VSS. Utilizing temporal-sensitive multi-level queries, our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
CVMay 29
The Regularizing Power of Language-Training Deepfake DetectorsBenedikt Hopf, Zongwei Wu, Radu Timofte
Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.
CVApr 18
Adverse-to-the-eXtreme Panoptic Segmentation: URVIS 2026 Study and BenchmarkYiting Wang, Nolwenn Peyratout, Tim Brodermann et al.
This paper presents the report of the URVIS 2026 challenge on adverse-to-extreme panoptic segmentation. As the first challenge of its kind, it attracted 17 registered participants and 47 submissions, with 4 teams reaching the final phase. The challenge is based on the MUSES dataset, a multi-sensor benchmark for panoptic segmentation in adverse-to-extreme weather, including RGB frame camera, LiDAR, radar, and event camera data. Weighted Panoptic Quality (wPQ) is designed and adopted as the official ranking metric for fair evaluation across weather conditions. In this report, we summarise the challenge setting and benchmark results, analyse the performance of the submitted methods, and discuss current progress and remaining challenges for robust multimodal panoptic segmentation. Link: https://urvis-workshop.github.io/challenge-Muses.html
CVMar 18
Video Understanding: From Geometry and Semantics to Unified ModelsZhaochong An, Zirui Li, Mingqiao Ye et al. · cambridge
Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.
CVAug 2, 2022
Robust RGB-D Fusion for Saliency DetectionZongwei Wu, Shriarulmozhivarman Gobichettipalayam, Brahim Tamadazte et al.
Efficiently exploiting multi-modal inputs for accurate RGB-D saliency detection is a topic of high interest. Most existing works leverage cross-modal interactions to fuse the two streams of RGB-D for intermediate features' enhancement. In this process, a practical aspect of the low quality of the available depths has not been fully considered yet. In this work, we aim for RGB-D saliency detection that is robust to the low-quality depths which primarily appear in two forms: inaccuracy due to noise and the misalignment to RGB. To this end, we propose a robust RGB-D fusion method that benefits from (1) layer-wise, and (2) trident spatial, attention mechanisms. On the one hand, layer-wise attention (LWA) learns the trade-off between early and late fusion of RGB and depth features, depending upon the depth accuracy. On the other hand, trident spatial attention (TSA) aggregates the features from a wider spatial context to address the depth misalignment problem. The proposed LWA and TSA mechanisms allow us to efficiently exploit the multi-modal inputs for saliency detection while being robust against low-quality depths. Our experiments on five benchmark datasets demonstrate that the proposed fusion method performs consistently better than the state-of-the-art fusion alternatives.
CVSep 28, 2024
Steering Prediction via a Multi-Sensor System for Autonomous RacingZhuyun Zhou, Zongwei Wu, Florian Bolli et al.
Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code, dataset, and benchmark will be released to promote future research.
CVJul 18, 2024
Restore Anything Model via Efficient Degradation AdaptationBin Ren, Eduard Zamfir, Zongwei Wu et al.
With the proliferation of mobile devices, the need for an efficient model to restore any degraded image has become increasingly significant and impactful. Traditional approaches typically involve training dedicated models for each specific degradation, resulting in inefficiency and redundancy. More recent solutions either introduce additional modules to learn visual prompts significantly increasing model size or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed RAM, takes a unified path that leverages inherent similarities across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism without scaling up the model or relying on large multimodal models. Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them in a gated manner. This intrinsic degradation awareness is further combined with contextualized attention in an X-shaped framework, enhancing local-global interactions. Extensive benchmarking in an all-in-one restoration setting confirms RAM's SOTA performance, reducing model complexity by approximately 82% in trainable parameters and 85% in FLOPs. Our code and models will be publicly available.
CVMay 6
The First Controllable Bokeh Rendering Challenge at NTIRE 2026Tim Seizinger, Florin-Alexandru Vasluianu, Jeffrey Chen et al.
This study presents the outcomes of the first Controllable Bokeh Rendering Challenge at NTIRE and highlights the most effective submitted methodologies. In total, 44 participants registered for the competition, of which 8 teams submitted valid solutions after the conclusion of the final test phase. All submissions were evaluated on unseen images, focusing on portraits and intricate subjects with complex and visually appealing bokeh phenomena. In addition to the first track focusing on established quantitative fidelity metrics, we conducted a qualitative user study with a panel of experts for a second track focusing on perceptual assessment. As this was the inaugural challenge on this topic, most of the participants focused on refining and extending the Bokehlicious baseline method.
CVJun 8, 2022
Depth-Adapted CNNs for RGB-D Semantic SegmentationZongwei Wu, Guillaume Allibert, Christophe Stolz et al.
Recent RGB-D semantic segmentation has motivated research interest thanks to the accessibility of complementary modalities from the input side. Existing works often adopt a two-stream architecture that processes photometric and geometric information in parallel, with few methods explicitly leveraging the contribution of depth cues to adjust the sampling position on RGB images. In this paper, we propose a novel framework to incorporate the depth information in the RGB convolutional neural network (CNN), termed Z-ACN (Depth-Adapted CNN). Specifically, our Z-ACN generates a 2D depth-adapted offset which is fully constrained by low-level features to guide the feature extraction on RGB images. With the generated offset, we introduce two intuitive and effective operations to replace basic CNN operators: depth-adapted convolution and depth-adapted average pooling. Extensive experiments on both indoor and outdoor semantic segmentation tasks demonstrate the effectiveness of our approach.
IVFeb 5, 2024Code
See More Details: Efficient Image Super-Resolution by Experts MiningEduard Zamfir, Zongwei Wu, Nancy Mehta et al.
Reconstructing high-resolution (HR) images from low-resolution (LR) inputs poses a significant challenge in image super-resolution (SR). While recent approaches have demonstrated the efficacy of intricate operations customized for various objectives, the straightforward stacking of these disparate operations can result in a substantial computational burden, hampering their practical utility. In response, we introduce SeemoRe, an efficient SR model employing expert mining. Our approach strategically incorporates experts at different levels, adopting a collaborative methodology. At the macro scale, our experts address rank-wise and spatial-wise informative features, providing a holistic understanding. Subsequently, the model delves into the subtleties of rank choice by leveraging a mixture of low-rank experts. By tapping into experts specialized in distinct key factors crucial for accurate SR, our model excels in uncovering intricate intra-feature details. This collaborative approach is reminiscent of the concept of "see more", allowing our model to achieve an optimal performance with minimal computational costs in efficient settings. The source will be publicly made available at https://github.com/eduardzamfir/seemoredetails
CVMar 1, 2024Code
Rethinking Few-shot 3D Point Cloud Semantic SegmentationZhaochong An, Guolei Sun, Yun Liu et al.
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS), with a focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. The former arises from non-uniform point sampling, allowing models to distinguish the density disparities between foreground and background for easier segmentation. The latter results from sampling only 2,048 points, limiting semantic information and deviating from the real-world practice. To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built. Moreover, we propose a novel FS-PCS model. While previous methods are based on feature optimization by mainly refining support features to enhance prototypes, our method is based on correlation optimization, referred to as Correlation Optimization Segmentation (COSeg). Specifically, we compute Class-specific Multi-prototypical Correlation (CMC) for each query point, representing its correlations to category prototypes. Then, we propose the Hyper Correlation Augmentation (HCA) module to enhance CMC. Furthermore, tackling the inherent property of few-shot training to incur base susceptibility for models, we propose to learn non-parametric prototypes for the base classes during training. The learned base prototypes are used to calibrate correlations for the background class through a Base Prototypes Calibration (BPC) module. Experiments on popular datasets demonstrate the superiority of COSeg over existing methods. The code is available at: https://github.com/ZhaochongAn/COSeg
CVMar 27, 2024Code
Towards Image Ambient Lighting NormalizationFlorin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu et al.
Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to more realistic settings encountered in daily use. In this paper, we propose a new challenging task termed Ambient Lighting Normalization (ALN), which enables the study of interactions between shadows, unifying image restoration and shadow removal in a broader context. To address the lack of appropriate datasets for ALN, we introduce the large-scale high-resolution dataset Ambient6K, comprising samples obtained from multiple light sources and including self-shadows resulting from complex geometries, which is the first of its kind. For benchmarking, we select various mainstream methods and rigorously evaluate them on Ambient6K. Additionally, we propose IFBlend, a novel strong baseline that maximizes Image-Frequency joint entropy to selectively restore local areas under different lighting conditions, without relying on shadow localization priors. Experiments show that IFBlend achieves SOTA scores on Ambient6K and exhibits competitive performance on conventional shadow removal benchmarks compared to shadow-specific models with mask priors. The dataset, benchmark, and code are available at https://github.com/fvasluianu97/IFBlend.
CVMar 20, 2025Code
Bokehlicious: Photorealistic Bokeh Rendering with Controllable AperturesTim Seizinger, Florin-Alexandru Vasluianu, Marcos V. Conde et al.
Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with variable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data can be found at https://github.com/TimSeizinger/Bokehlicious
CVApr 19, 2025Code
Any Image Restoration via Efficient Spatial-Frequency Degradation AdaptationBin Ren, Eduard Zamfir, Zongwei Wu et al.
Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models.Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (https://amazingren.github.io/AnyIR/)
CVAug 4, 2025Code
After the Party: Navigating the Mapping From Color to Ambient LightingFlorin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu et al.
Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity-luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at www.github.com/fvasluianu97/RLN2.
CVJul 8, 2025Code
What You Have is What You Track: Adaptive and Robust Multimodal TrackingYuedong Tan, Jiawei Shao, Eduard Zamfir et al.
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
CVMay 17, 2023Code
Object Segmentation by Mining Cross-Modal SemanticsZongwei Wu, Jingjing Wang, Zhuyun Zhou et al.
Multi-sensor clues have shown promise for object segmentation, but inherent noise in each sensor, as well as the calibration error in practice, may bias the segmentation accuracy. In this paper, we propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features, with the aim of controlling the modal contribution based on relative entropy. We explore semantics among the multimodal inputs in two aspects: the modality-shared consistency and the modality-specific variation. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision. On the one hand, the AF block explicitly dissociates the shared and specific representation and learns to weight the modal contribution by adjusting the \textit{proportion, region,} and \textit{pattern}, depending upon the quality. On the other hand, our CFD initially decodes the shared feature and then refines the output through specificity-aware querying. Further, we enforce semantic consistency across the decoding layers to enable interaction across network hierarchies, improving feature discriminability. Exhaustive comparison on eleven datasets with depth or thermal clues, and on two challenging tasks, namely salient and camouflage object segmentation, validate our effectiveness in terms of both performance and robustness. The source code is publicly available at https://github.com/Zongwei97/XMSNet.
CVApr 15, 2024
NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and ResultsZheng Chen, Zongwei Wu, Eduard Zamfir et al.
This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.
CVApr 3
The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge ReportBin Ren, Hang Guo, Yan Shu et al.
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.
CVApr 22, 2024
NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and ResultsXiaoning Liu, Zongwei Wu, Ao Li et al.
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field.
CVMay 4
NTIRE 2026 Challenge on Efficient Low Light Image Enhancement: Methods and ResultsJiebin Yan, Chenyu Tu, Weixia Zhang et al.
This paper presents a comprehensive review of the NITRE 2026 Efficient Low Light Image Enhancement (E-LLIE) Challenge, highlighting the proposed solutions and final outcomes. This challenge focuses on mobile image enhancement under low-light conditions, aiming to design lightweight networks that improve enhancement quality while ensuring practical deployability under limited computational resources. A total of 207 participants registered, 27 teams submitted valid entries, and 17 teams ultimately provided valid factsheet. Based on these submissions, this paper provides a systematic evaluation of recent methods for E-LLIE, offering a comprehensive overview of state-of-the-art progress and demonstrating significant improvements in both performance and efficiency.
CVApr 14, 2025
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge ReportBin Ren, Hang Guo, Lei Sun et al.
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
CVMay 25, 2025
NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and ResultsVarun Jain, Zongwei Wu, Quan Zou et al.
This paper presents a comprehensive review of the 1st Challenge on Video Quality Enhancement for Video Conferencing held at the NTIRE workshop at CVPR 2025, and highlights the problem statement, datasets, proposed solutions, and results. The aim of this challenge was to design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, (c) reducing noise, and (d) enhancing sharpness - giving a professional studio-like effect. Participants were given a differentiable Video Quality Assessment (VQA) model, training, and test videos. A total of 91 participants registered for the challenge. We received 10 valid submissions that were evaluated in a crowdsourced framework.
CVOct 15, 2025
NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and ResultsXiaoning Liu, Zongwei Wu, Florin-Alexandru Vasluianu et al.
This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.
CVJun 18, 2025
NTIRE 2025 Image Shadow Removal Challenge ReportFlorin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou et al.
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
CVApr 17, 2025
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and ResultsXin Li, Yeying Jin, Xin Jin et al.
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
CVApr 17, 2024
Event-Based Eye Tracking. AIS 2024 Challenge SurveyZuowen Wang, Chang Gao, Zongwei Wu et al.
This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.
CVApr 20, 2025
NTIRE 2025 Challenge on Real-World Face Restoration: Methods and ResultsZheng Chen, Jingkai Wang, Kai Liu et al.
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
CVNov 27, 2024
Complexity Experts are Task-Discriminative Learners for Any Image RestorationEduard Zamfir, Zongwei Wu, Nancy Mehta et al.
Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source code and models are publicly available at \href{https://eduardzamfir.github.io/moceir/}{\texttt{eduardzamfir.github.io/MoCE-IR/}}
CVMay 24, 2024
Efficient Degradation-aware Any Image RestorationEduard Zamfir, Zongwei Wu, Nancy Mehta et al.
Reconstructing missing details from degraded low-quality inputs poses a significant challenge. Recent progress in image restoration has demonstrated the efficacy of learning large models capable of addressing various degradations simultaneously. Nonetheless, these approaches introduce considerable computational overhead and complex learning paradigms, limiting their practical utility. In response, we propose \textit{DaAIR}, an efficient All-in-One image restorer employing a Degradation-aware Learner (DaLe) in the low-rank regime to collaboratively mine shared aspects and subtle nuances across diverse degradations, generating a degradation-aware embedding. By dynamically allocating model capacity to input degradations, we realize an efficient restorer integrating holistic and specific learning within a unified model. Furthermore, DaAIR introduces a cost-efficient parameter update mechanism that enhances degradation awareness while maintaining computational efficiency. Extensive comparisons across five image degradations demonstrate that our DaAIR outperforms both state-of-the-art All-in-One models and degradation-specific counterparts, affirming our efficacy and practicality. The source will be publicly made available at https://eduardzamfir.github.io/daair/
CVApr 25, 2025
Event-Based Eye Tracking. 2025 Event-based Vision WorkshopQinyu Chen, Chang Gao, Min Liu et al.
This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.
CVDec 10, 2024
ReCap: Better Gaussian Relighting with Cross-Environment CapturesJingzhi Li, Zongwei Wu, Eduard Zamfir et al.
Accurate 3D objects relighting in diverse unseen environments is crucial for realistic virtual object placement. Due to the albedo-lighting ambiguity, existing methods often fall short in producing faithful relights. Without proper constraints, observed training views can be explained by numerous combinations of lighting and material attributes, lacking physical correspondence with the actual environment maps used for relighting. In this work, we present ReCap, treating cross-environment captures as multi-task target to provide the missing supervision that cuts through the entanglement. Specifically, ReCap jointly optimizes multiple lighting representations that share a common set of material attributes. This naturally harmonizes a coherent set of lighting representations around the mutual material attributes, exploiting commonalities and differences across varied object appearances. Such coherence enables physically sound lighting reconstruction and robust material estimation - both essential for accurate relighting. Together with a streamlined shading function and effective post-processing, ReCap outperforms all leading competitors on an expanded relighting benchmark.
CVMar 15, 2024
PASTA: Towards Flexible and Efficient HDR Imaging Via Progressively Aggregated Spatio-Temporal AlignmentXiaoning Liu, Ao Li, Zongwei Wu et al.
Leveraging Transformer attention has led to great advancements in HDR deghosting. However, the intricate nature of self-attention introduces practical challenges, as existing state-of-the-art methods often demand high-end GPUs or exhibit slow inference speeds, especially for high-resolution images like 2K. Striking an optimal balance between performance and latency remains a critical concern. In response, this work presents PASTA, a novel Progressively Aggregated Spatio-Temporal Alignment framework for HDR deghosting. Our approach achieves effectiveness and efficiency by harnessing hierarchical representation during feature distanglement. Through the utilization of diverse granularities within the hierarchical structure, our method substantially boosts computational speed and optimizes the HDR imaging workflow. In addition, we explore within-scale feature modeling with local and global attention, gradually merging and refining them in a coarse-to-fine fashion. Experimental results showcase PASTA's superiority over current SOTA methods in both visual quality and performance metrics, accompanied by a substantial 3-fold (x3) increase in inference speed.
CVApr 30, 2024
MIPI 2024 Challenge on Nighttime Flare Removal: Methods and ResultsYuekun Dai, Dafeng Zhang, Xiaoming Li et al.
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
CVJan 31, 2025
ContextFormer: Redefining Efficiency in Semantic SegmentationMian Muhammad Naeem Abid, Nancy Mehta, Zongwei Wu et al.
Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Branched DepthwiseConv (Trans-BDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available upon acceptance.
CVAug 11, 2025
DiTVR: Zero-Shot Diffusion Transformer for Video RestorationSicheng Gao, Nancy Mehta, Zongwei Wu et al.
Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.
CVApr 20, 2025
NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and ResultsZheng Chen, Kai Liu, Jue Gong et al.
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
IVDec 5, 2024
INRetouch: Context Aware Implicit Neural Representation for Photography RetouchingOmar Elezabi, Marcos V. Conde, Zongwei Wu et al.
Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs, enabling precise replication of complex editing operations. We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context, and is capable of learning from a single example. Our method extracts implicit transformations from reference edits and adaptively applies them to new images. To facilitate this research direction, we introduce a comprehensive Photo Retouching Dataset comprising 100,000 high-quality images edited using over 170 professional Adobe Lightroom presets. Through extensive evaluation, we demonstrate that our approach not only surpasses existing methods in photo retouching but also enhances performance in related image reconstruction tasks like Gamut Mapping and Raw Reconstruction. By bridging the gap between professional editing capabilities and automated solutions, our work presents a significant step toward making sophisticated photo editing more accessible while maintaining high-fidelity results. Check the Project Page at https://omaralezaby.github.io/inretouch for more Results and information about Code and Dataset availability.
CVNov 27, 2025
Layover or Direct Flight: Rethinking Audio-Guided Image SegmentationJoel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda et al.
Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.
CVSep 8, 2025
MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of RestorationGeorge Ciubotariu, Zhuyun Zhou, Zongwei Wu et al.
We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.