Xiaodong Xie

h-index27

24papers

2,629citations

Novelty51%

AI Score46

Ranked #60,758 of 205,806 authors (top 30%)#22,032 in CV (top 37%)

24 Papers

MMDec 24, 2022Code

Towards Blind Watermarking: Combining Invertible and Non-invertible Mechanisms

Rui Ma, Mengxi Guo, Yi Hou et al.

Blind watermarking provides powerful evidence for copyright protection, image authentication, and tampering identification. However, it remains a challenge to design a watermarking model with high imperceptibility and robustness against strong noise attacks. To resolve this issue, we present a framework Combining the Invertible and Non-invertible (CIN) mechanisms. The CIN is composed of the invertible part to achieve high imperceptibility and the non-invertible part to strengthen the robustness against strong noise attacks. For the invertible part, we develop a diffusion and extraction module (DEM) and a fusion and split module (FSM) to embed and extract watermarks symmetrically in an invertible way. For the non-invertible part, we introduce a non-invertible attention-based module (NIAM) and the noise-specific selection module (NSM) to solve the asymmetric extraction under a strong noise attack. Extensive experiments demonstrate that our framework outperforms the current state-of-the-art methods of imperceptibility and robustness significantly. Our framework can achieve an average of 99.99% accuracy and 67.66 dB PSNR under noise-free conditions, while 96.64% and 39.28 dB combined strong noise attacks. The code will be available in https://github.com/rmpku/CIN.

CVApr 3, 2023

Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

Yuheng Lu, Chenfeng Xu, Xiaobao Wei et al.

The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models,i.e.,CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.

CVJul 5, 2022

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Yuheng Lu, Chenfeng Xu, Xiaobao Wei et al.

Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-vocabulary point-cloud detection. As far as we know, we are the first to study the problem of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OV-3DETIC makes the point-cloud detector capable of achieving open-vocabulary detection. Extensive experiments demonstrate that the proposed OV-3DETIC achieves at least 10.77 % mAP improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OV-3DETIC works.

LGOct 17, 2022

Multi-Agent Automated Machine Learning

Zhaozhi Wang, Kefan Su, Jian Zhang et al.

In this paper, we propose multi-agent automated machine learning (MA2ML) with the aim to effectively handle joint optimization of modules in automated machine learning (AutoML). MA2ML takes each machine learning module, such as data augmentation (AUG), neural architecture search (NAS), or hyper-parameters (HPO), as an agent and the final performance as the reward, to formulate a multi-agent reinforcement learning problem. MA2ML explicitly assigns credit to each agent according to its marginal contribution to enhance cooperation among modules, and incorporates off-policy learning to improve search efficiency. Theoretically, MA2ML guarantees monotonic improvement of joint optimization. Extensive experiments show that MA2ML yields the state-of-the-art top-1 accuracy on ImageNet under constraints of computational cost, e.g., $79.7\%/80.5\%$ with FLOPs fewer than 600M/800M. Extensive ablation studies verify the benefits of credit assignment and off-policy learning of MA2ML.

CVJul 1, 2023

PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers

Peidong Jia, Jiaming Liu, Senqiao Yang et al.

The Transformer-based detectors (i.e., DETR) have demonstrated impressive performance on end-to-end object detection. However, transferring DETR to different data distributions may lead to a significant performance degradation. Existing adaptation techniques focus on model-based approaches, which aim to leverage feature alignment to narrow the distribution shift between different domains. In this study, we propose a hierarchical Prompt Domain Memory (PDM) for adapting detection transformers to different distributions. PDM comprehensively leverages the prompt memory to extract domain-specific knowledge and explicitly constructs a long-term memory space for the data distribution, which represents better domain diversity compared to existing methods. Specifically, each prompt and its corresponding distribution value are paired in the memory space, and we inject top M distribution-similar prompts into the input and multi-level embeddings of DETR. Additionally, we introduce the Prompt Memory Alignment (PMA) to reduce the discrepancy between the source and target domains by fully leveraging the domain-specific knowledge extracted from the prompt domain memory. Extensive experiments demonstrate that our method outperforms state-of-the-art domain adaptive object detection methods on three benchmarks, including scene, synthetic to real, and weather adaptation. Codes will be released.

CVNov 28, 2023

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Peidong Jia, Chenxuan Li, Yuhui Yuan et al.

Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a vague intention prompt into a high-quality multi-layered graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text. Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the superiority of our COLE system over existing methods in generating high-quality graphic designs from user intent. Last, we present a Canva-like multi-layered image editing tool to support flexible editing of the generated multi-layered graphic design images. We perceive our COLE system as an important step towards addressing more complex and multi-layered graphic design generation tasks in the future.

CVDec 22, 2023Code

FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Dongmei Zhang, Chang Li, Ray Zhang et al.

The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

MMDec 18, 2025

A Tri-Dynamic Preprocessing Framework for UGC Video Compression

Fei Zhao, Mengxi Guo, Shijie Zhao et al.

In recent years, user generated content (UGC) has become the dominant force in internet traffic. However, UGC videos exhibit a higher degree of variability and diverse characteristics compared to traditional encoding test videos. This variance challenges the effectiveness of data-driven machine learning algorithms for optimizing encoding in the broader context of UGC scenarios. To address this issue, we propose a Tri-Dynamic Preprocessing framework for UGC. Firstly, we employ an adaptive factor to regulate preprocessing intensity. Secondly, an adaptive quantization level is employed to fine-tune the codec simulator. Thirdly, we utilize an adaptive lambda tradeoff to adjust the rate-distortion loss function. Experimental results on large-scale test sets demonstrate that our method attains exceptional performance.

MMDec 17, 2025

A Preprocessing Framework for Video Machine Vision under Compression

Fei Zhao, Mengxi Guo, Shijie Zhao et al.

There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.

CVJan 4, 2024

A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models

Rui Ma, Qiang Zhou, Yizhu Jin et al.

Copyright law confers upon creators the exclusive rights to reproduce, distribute, and monetize their creative works. However, recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. These technologies enable the unauthorized learning and replication of copyrighted content, artistic creations, and likenesses, leading to the proliferation of unregulated content. Notably, models like stable diffusion, which excel in text-to-image synthesis, heighten the risk of copyright infringement and unauthorized distribution.Machine unlearning, which seeks to eradicate the influence of specific data or concepts from machine learning models, emerges as a promising solution by eliminating the \enquote{copyright memories} ingrained in diffusion models. Yet, the absence of comprehensive large-scale datasets and standardized benchmarks for evaluating the efficacy of unlearning techniques in the copyright protection scenarios impedes the development of more effective unlearning methods. To address this gap, we introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset. This dataset encompasses anchor images, associated prompts, and images synthesized by text-to-image models. Additionally, we have developed a mixed metric based on semantic and style information, validated through both human and artist assessments, to gauge the effectiveness of unlearning approaches. Our dataset, benchmark library, and evaluation metrics will be made publicly available to foster future research and practical applications (https://rmpku.github.io/CPDM-page/, website / http://149.104.22.83/unlearning.tar.gz, dataset).

CVJan 22, 2022

Enhancing and Dissecting Crowd Counting By Synthetic Data

Yi Hou, Chengyang Li, Yuheng Lu et al.

In this article, we propose a simulated crowd counting dataset CrowdX, which has a large scale, accurate labeling, parameterized realization, and high fidelity. The experimental results of using this dataset as data enhancement show that the performance of the proposed streamlined and efficient benchmark network ESA-Net can be improved by 8.4\%. The other two classic heterogeneous architectures MCNN and CSRNet pre-trained on CrowdX also show significant performance improvements. Considering many influencing factors determine performance, such as background, camera angle, human density, and resolution. Although these factors are important, there is still a lack of research on how they affect crowd counting. Thanks to the CrowdX dataset with rich annotation information, we conduct a large number of data-driven comparative experiments to analyze these factors. Our research provides a reference for a deeper understanding of the crowd counting problem and puts forward some useful suggestions in the actual deployment of the algorithm.

CVJan 22, 2022

BBA-net: A bi-branch attention network for crowd counting

Yi Hou, Chengyang Li, Fan Yang et al.

In the field of crowd counting, the current mainstream CNN-based regression methods simply extract the density information of pedestrians without finding the position of each person. This makes the output of the network often found to contain incorrect responses, which may erroneously estimate the total number and not conducive to the interpretation of the algorithm. To this end, we propose a Bi-Branch Attention Network (BBA-NET) for crowd counting, which has three innovation points. i) A two-branch architecture is used to estimate the density information and location information separately. ii) Attention mechanism is used to facilitate feature extraction, which can reduce false responses. iii) A new density map generation method combining geometric adaptation and Voronoi split is introduced. Our method can integrate the pedestrian's head and body information to enhance the feature expression ability of the density map. Extensive experiments performed on two public datasets show that our method achieves a lower crowd counting error compared to other state-of-the-art methods.

HCOct 8, 2021

The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE

Mengxi Guo, Dangqing Huang, Xiaodong Xie

Graphic design is ubiquitous in people's daily lives. For graphic design, the most time-consuming task is laying out various components in the interface. Repetitive manual layout design will waste a lot of time for professional graphic designers. Existing templates are usually rudimentary and not suitable for most designs, reducing efficiency and limiting creativity. This paper implemented the Transformer model and conditional variational autoencoder (CVAE) to the graphic design layout generation task. It proposed an end-to-end graphic design layout generation model named LayoutT-CVAE. We also proposed element disentanglement and feature-based disentanglement strategies and introduce new graphic design principles and similarity metrics into the model, which significantly increased the controllability and interpretability of the deep model. Compared with the existing state-of-art models, the layout generated by ours performs better on many metrics.

CVMay 4, 2020

Correlating Edge, Pose with Parsing

Ziwei Zhang, Chi Su, Liang Zheng et al.

According to existing studies, human body edge and pose are two beneficial factors to human parsing. The effectiveness of each of the high-level features (edge and pose) is confirmed through the concatenation of their features with the parsing features. Driven by the insights, this paper studies how human semantic boundaries and keypoint locations can jointly improve human parsing. Compared with the existing practice of feature concatenation, we find that uncovering the correlation among the three factors is a superior way of leveraging the pivotal contextual cues provided by edges and poses. To capture such correlations, we propose a Correlation Parsing Machine (CorrPM) employing a heterogeneous non-local block to discover the spatial affinity among feature maps from the edge, pose and parsing. The proposed CorrPM allows us to report new state-of-the-art accuracy on three human parsing datasets. Importantly, comparative studies confirm the advantages of feature correlation over the concatenation.

CVNov 18, 2019

FFA-Net: Feature Fusion Attention Network for Single Image Dehazing

Xu Qin, Zhilin Wang, Yuanchao Bai et al.

In this paper, we propose an end-to-end feature fusion at-tention network (FFA-Net) to directly restore the haze-free image. The FFA-Net architecture consists of three key components: 1) A novel Feature Attention (FA) module combines Channel Attention with Pixel Attention mechanism, considering that different channel-wise features contain totally different weighted information and haze distribution is uneven on the different image pixels. FA treats different features and pixels unequally, which provides additional flexibility in dealing with different types of information, expanding the representational ability of CNNs. 2) A basic block structure consists of Local Residual Learning and Feature Attention, Local Residual Learning allowing the less important information such as thin haze region or low-frequency to be bypassed through multiple local residual connections, let main network architecture focus on more effective information. 3) An Attention-based different levels Feature Fusion (FFA) structure, the feature weights are adaptively learned from the Feature Attention (FA) module, giving more weight to important features. This structure can also retain the information of shallow layers and pass it into deep layers. The experimental results demonstrate that our proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23db to 36.39db on the SOTS indoor test dataset. Code has been made available at GitHub.

CVJun 11, 2019

Single Image Blind Deblurring Using Multi-Scale Latent Structure Prior

Yuanchao Bai, Huizhu Jia, Ming Jiang et al.

Blind image deblurring is a challenging problem in computer vision, which aims to restore both the blur kernel and the latent sharp image from only a blurry observation. Inspired by the prevalent self-example prior in image super-resolution, in this paper, we observe that a coarse enough image down-sampled from a blurry observation is approximately a low-resolution version of the latent sharp image. We prove this phenomenon theoretically and define the coarse enough image as a latent structure prior of the unknown sharp image. Starting from this prior, we propose to restore sharp images from the coarsest scale to the finest scale on a blurry image pyramid, and progressively update the prior image using the newly restored sharp image. These coarse-to-fine priors are referred to as \textit{Multi-Scale Latent Structures} (MSLS). Leveraging the MSLS prior, our algorithm comprises two phases: 1) we first preliminarily restore sharp images in the coarse scales; 2) we then apply a refinement process in the finest scale to obtain the final deblurred image. In each scale, to achieve lower computational complexity, we alternately perform a sharp image reconstruction with fast local self-example matching, an accelerated kernel estimation with error compensation, and a fast non-blind image deblurring, instead of computing any computationally expensive non-convex priors. We further extend the proposed algorithm to solve more challenging non-uniform blind image deblurring problem. Extensive experiments demonstrate that our algorithm achieves competitive results against the state-of-the-art methods with much faster running speed.

CVOct 13, 2018

Attention Driven Person Re-identification

Fan Yang, Ke Yan, Shijian Lu et al.

Person re-identification (ReID) is a challenging task due to arbitrary human pose variations, background clutters, etc. It has been studied extensively in recent years, but the multifarious local and global features are still not fully exploited by either ignoring the interplay between whole-body images and body-part images or missing in-depth examination of specific body-part images. In this paper, we propose a novel attention-driven multi-branch network that learns robust and discriminative human representation from global whole-body images and local body-part images simultaneously. Within each branch, an intra-attention network is designed to search for informative and discriminative regions within the whole-body or body-part images, where attention is elegantly decomposed into spatial-wise attention and channel-wise attention for effective and efficient learning. In addition, a novel inter-attention module is designed which fuses the output of intra-attention networks adaptively for optimal person ReID. The proposed technique has been evaluated over three widely used datasets CUHK03, Market-1501 and DukeMTMC-ReID, and experiments demonstrate its superior robustness and effectiveness as compared with the state of the arts.

CVApr 12, 2018

Trajectory Factory: Tracklet Cleaving and Re-connection by Deep Siamese Bi-GRU for Multiple Object Tracking

Cong Ma, Changshui Yang, Fan Yang et al.

Multi-Object Tracking (MOT) is a challenging task in the complex scene such as surveillance and autonomous driving. In this paper, we propose a novel tracklet processing method to cleave and re-connect tracklets on crowd or long-term occlusion by Siamese Bi-Gated Recurrent Unit (GRU). The tracklet generation utilizes object features extracted by CNN and RNN to create the high-confidence tracklet candidates in sparse scenario. Due to mis-tracking in the generation process, the tracklets from different objects are split into several sub-tracklets by a bidirectional GRU. After that, a Siamese GRU based tracklet re-connection method is applied to link the sub-tracklets which belong to the same object to form a whole trajectory. In addition, we extract the tracklet images from existing MOT datasets and propose a novel dataset to train our networks. The proposed dataset contains more than 95160 pedestrian images. It has 793 different persons in it. On average, there are 120 images for each person with positions and sizes. Experimental results demonstrate the advantages of our model over the state-of-the-art methods on MOT16.

MMMar 15, 2018

Joint Rate Allocation with Both Look-ahead And Feedback Model For High Efficiency Video Coding

Hongfei Fan, Lin Ding, Xiaodong Xie et al.

The objective of joint rate allocation among multiple coded video streams is to share the bandwidth to meet the demands of minimum average distortion (minAVE) or minimum distortion variance (minVAR). In previous works on minVAR problems, bits are directly assigned in proportion to their complexity measures and we call it look-ahead allocation model (LAM), which leads to the fact that the performance will totally depend on the accuracy of the complexity measures. This paper proposes a look-ahead and feedback allocation model (LFAM) for joint rate allocation for High Efficiency Video Coding (HEVC) platform which requires negligible computational cost. We derive the model from the target function of minVAR theoretically. The bits are assigned according to the complexity measures, the distortion and bitrate values fed back by the encoder together. We integrated the proposed allocation model in HEVC reference software HM16.0 and several complexity measures were applied to our allocation model. Results demonstrate that our proposed LFAM performs better than LAM, and an average of 65.94% variance of mean square error (MSE) is saved with different complexity measures.

SPDec 9, 2017

Noise Level Estimation for Overcomplete Dictionary Learning Based on Tight Asymptotic Bounds

Rui Chen, Changshui Yang, Huizhu Jia et al.

In this letter, we address the problem of estimating Gaussian noise level from the trained dictionaries in update stage. We first provide rigorous statistical analysis on the eigenvalue distributions of a sample covariance matrix. Then we propose an interval-bounded estimator for noise variance in high dimensional setting. To this end, an effective estimation method for noise level is devised based on the boundness and asymptotic behavior of noise eigenvalue spectrum. The estimation performance of our method has been guaranteed both theoretically and empirically. The analysis and experiment results have demonstrated that the proposed algorithm can reliably infer true noise levels, and outperforms the relevant existing methods.

CVMay 17, 2017

Bayer Demosaicking Using Optimized Mean Curvature over RGB channels

Rui Chen, Huizhu Jia, Xiange Wen et al.

Color artifacts of demosaicked images are often found at contours due to interpolation across edges and cross-channel aliasing. To tackle this problem, we propose a novel demosaicking method to reliably reconstruct color channels of a Bayer image based on two different optimized mean-curvature (MC) models. The missing pixel values in green (G) channel are first estimated by minimizing a variational MC model. The curvatures of restored G-image surface are approximated as a linear MC model which guides the initial reconstruction of red (R) and blue (B) channels. Then a refinement process is performed to interpolate accurate full-resolution R and B images. Experiments on benchmark images have testified to the superiority of the proposed method in terms of both the objective and subjective quality.

CVApr 4, 2017

Learning a collaborative multiscale dictionary based on robust empirical mode decomposition

Rui Chen, Huizhu Jia, Xiaodong Xie et al.

Dictionary learning is a challenge topic in many image processing areas. The basic goal is to learn a sparse representation from an overcomplete basis set. Due to combining the advantages of generic multiscale representations with learning based adaptivity, multiscale dictionary representation approaches have the power in capturing structural characteristics of natural images. However, existing multiscale learning approaches still suffer from three main weaknesses: inadaptability to diverse scales of image data, sensitivity to noise and outliers, difficulty to determine optimal dictionary structure. In this paper, we present a novel multiscale dictionary learning paradigm for sparse image representations based on an improved empirical mode decomposition. This powerful data-driven analysis tool for multi-dimensional signal can fully adaptively decompose the image into multiscale oscillating components according to intrinsic modes of data self. This treatment can obtain a robust and effective sparse representation, and meanwhile generates a raw base dictionary at multiple geometric scales and spatial frequency bands. This dictionary is refined by selecting optimal oscillating atoms based on frequency clustering. In order to further enhance sparsity and generalization, a tolerance dictionary is learned using a coherence regularized model. A fast proximal scheme is developed to optimize this model. The multiscale dictionary is considered as the product of oscillating dictionary and tolerance dictionary. Experimental results demonstrate that the proposed learning approach has the superior performance in sparse image representations as compared with several competing methods. We also show the promising results in image denoising application.

CVDec 23, 2016

Correlation Preserving Sparse Coding Over Multi-level Dictionaries for Image Denoising

Rui Chen, Huizhu Jia, Xiaodong Xie et al.

In this letter, we propose a novel image denoising method based on correlation preserving sparse coding. Because the instable and unreliable correlations among basis set can limit the performance of the dictionary-driven denoising methods, two effective regularized strategies are employed in the coding process. Specifically, a graph-based regularizer is built for preserving the global similarity correlations, which can adaptively capture both the geometrical structures and discriminative features of textured patches. In particular, edge weights in the graph are obtained by seeking a nonnegative low-rank construction. Besides, a robust locality-constrained coding can automatically preserve not only spatial neighborhood information but also internal consistency present in noisy patches while learning overcomplete dictionary. Experimental results demonstrate that our proposed method achieves state-of-the-art denoising performance in terms of both PSNR and subjective visual quality.

CVDec 23, 2016

Blind restoration for non-uniform aerial images using non-local Retinex model and shearlet-based higher-order regularization

Rui Chen, Huizhu Jia, Xiaodong Xie et al.

Aerial images are often degraded by space-varying motion blur and simultaneous uneven illumination. To recover high-quality aerial image from its non-uniform version, we propose a novel patch-wise restoration approach based on a key observation that the degree of blurring is inevitably affected by the illuminated conditions. A non-local Retinex model is developed to accurately estimate the reflectance component from the degraded aerial image. Thereafter the uneven illumination is corrected well. And then non-uniform coupled blurring in the enhanced reflectance image is alleviated and transformed towards uniform distribution, which will facilitate the subsequent deblurring. For constructing the multi-scale sparsified regularizer, the discrete shearlet transform is improved to better represent anisotropic image features in term of directional sensitivity and selectivity. In addition, a new adaptive variant of total generalized variation is proposed for the structure-preserving regularizer. These complementary regularizers are elegantly integrated into an objective function. The final deblurred image with uniform illumination can be extracted by applying the fast alternating direction scheme to solve the derived function. The experimental results demonstrate that our algorithm can not only remove both the space-varying illumination and motion blur in the aerial image effectively but also recover the abundant details of aerial scenes with top-level objective and subjective quality, and outperforms other state-of-the-art restoration methods.