CVApr 4, 2023Code
DCANet: Dual Convolutional Neural Network with Attention for Image Blind DenoisingWencong Wu, Guannan Lv, Yingying Duan et al.
Noise removal of images is an essential preprocessing procedure for many computer vision tasks. Currently, many denoising models based on deep neural networks can perform well in removing the noise with known distributions (i.e. the additive Gaussian white noise). However eliminating real noise is still a very challenging task, since real-world noise often does not simply follow one single type of distribution, and the noise may spatially vary. In this paper, we present a new dual convolutional neural network (CNN) with attention for image blind denoising, named as the DCANet. To the best of our knowledge, the proposed DCANet is the first work that integrates both the dual CNN and attention mechanism for image denoising. The DCANet is composed of a noise estimation network, a spatial and channel attention module (SCAM), and a CNN with a dual structure. The noise estimation network is utilized to estimate the spatial distribution and the noise level in an image. The noisy image and its estimated noise are combined as the input of the SCAM, and a dual CNN contains two different branches is designed to learn the complementary features to obtain the denoised image. The experimental results have verified that the proposed DCANet can suppress both synthetic and real noise effectively. The code of DCANet is available at https://github.com/WenCongWu/DCANet.
24.9CVMay 27
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image ReasoningGuannan Lv, Ren Nie, Hongjian Dou
Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.
CVApr 4, 2023Code
Image Blind Denoising Using Dual Convolutional Neural Network with Skip ConnectionWencong Wu, Shicheng Liao, Guannan Lv et al.
In recent years, deep convolutional neural networks have shown fascinating performance in the field of image denoising. However, deeper network architectures are often accompanied with large numbers of model parameters, leading to high training cost and long inference time, which limits their application in practical denoising tasks. In this paper, we propose a novel dual convolutional blind denoising network with skip connection (DCBDNet), which is able to achieve a desirable balance between the denoising effect and network complexity. The proposed DCBDNet consists of a noise estimation network and a dual convolutional neural network (CNN). The noise estimation network is used to estimate the noise level map, which improves the flexibility of the proposed model. The dual CNN contains two branches: a u-shaped sub-network is designed for the upper branch, and the lower branch is composed of the dilated convolution layers. Skip connections between layers are utilized in both the upper and lower branches. The proposed DCBDNet was evaluated on several synthetic and real-world image denoising benchmark datasets. Experimental results have demonstrated that the proposed DCBDNet can effectively remove gaussian noise in a wide range of levels, spatially variant noise and real noise. With a simple model structure, our proposed DCBDNet still can obtain competitive denoising performance compared to the state-of-the-art image denoising models containing complex architectures. Namely, a favorable trade-off between denoising performance and model complexity is achieved. Codes are available at https://github.com/WenCongWu/DCBDNet.
AIMar 1Code
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant AdvantageHaowen Gao, Zhenyu Zhang, Liang Pang et al.
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
CVJan 5, 2024Code
Two-stage Progressive Residual Dense Attention Network for Image DenoisingWencong Wu, An Ge, Guannan Lv et al.
Deep convolutional neural networks (CNNs) for image denoising can effectively exploit rich hierarchical features and have achieved great success. However, many deep CNN-based denoising models equally utilize the hierarchical features of noisy images without paying attention to the more important and useful features, leading to relatively low performance. To address the issue, we design a new Two-stage Progressive Residual Dense Attention Network (TSP-RDANet) for image denoising, which divides the whole process of denoising into two sub-tasks to remove noise progressively. Two different attention mechanism-based denoising networks are designed for the two sequential sub-tasks: the residual dense attention module (RDAM) is designed for the first stage, and the hybrid dilated residual dense attention module (HDRDAM) is proposed for the second stage. The proposed attention modules are able to learn appropriate local features through dense connection between different convolutional layers, and the irrelevant features can also be suppressed. The two sub-networks are then connected by a long skip connection to retain the shallow feature to enhance the denoising performance. The experiments on seven benchmark datasets have verified that compared with many state-of-the-art methods, the proposed TSP-RDANet can obtain favorable results both on synthetic and real noisy image denoising. The code of our TSP-RDANet is available at https://github.com/WenCongWu/TSP-RDANet.