CVNov 26, 2022Code
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Jiangshe Zhang et al. · eth-zurich, harvard
Multi-modality (MM) image fusion aims to render fused images that maintain the merits of different modalities, e.g., functional highlight and detailed textures. To tackle the challenge in modeling cross-modality features and decomposing desirable modality-specific and modality-shared features, we propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. Firstly, CDDFuse uses Restormer blocks to extract cross-modality shallow features. We then introduce a dual-branch Transformer-CNN feature extractor with Lite Transformer (LT) blocks leveraging long-range attention to handle low-frequency global features and Invertible Neural Networks (INN) blocks focusing on extracting high-frequency local information. A correlation-driven loss is further proposed to make the low-frequency features correlated while the high-frequency features uncorrelated based on the embedded information. Then, the LT-based global fusion and INN-based local fusion layers output the fused image. Extensive experiments demonstrate that our CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. We also show that CDDFuse can boost the performance in downstream infrared-visible semantic segmentation and object detection in a unified benchmark. The code is available at https://github.com/Zhaozixiang1228/MMIF-CDDFuse.
CVMar 13, 2023Code
DDFM: Denoising Diffusion Model for Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Yuanzhi Zhu et al. · eth-zurich
Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at \url{https://github.com/Zhaozixiang1228/MMIF-DDFM}.
CVMar 15, 2023Code
Spherical Space Feature Decomposition for Guided Depth Map Super-ResolutionZixiang Zhao, Jiangshe Zhang, Xiang Gu et al. · eth-zurich
Guided depth map super-resolution (GDSR), as a hot topic in multi-modal image processing, aims to upsample low-resolution (LR) depth maps with additional information involved in high-resolution (HR) RGB images from the same scene. The critical step of this task is to effectively extract domain-shared and domain-private RGB/depth features. In addition, three detailed issues, namely blurry edges, noisy surfaces, and over-transferred RGB texture, need to be addressed. In this paper, we propose the Spherical Space feature Decomposition Network (SSDNet) to solve the above issues. To better model cross-modality features, Restormer block-based RGB/depth encoders are employed for extracting local-global features. Then, the extracted features are mapped to the spherical space to complete the separation of private features and the alignment of shared features. Shared features of RGB are fused with the depth features to complete the GDSR task. Subsequently, a spherical contrast refinement (SCR) module is proposed to further address the detail issues. Patches that are classified according to imperfect categories are input into the SCR module, where the patch features are pulled closer to the ground truth and pushed away from the corresponding imperfect samples in the spherical feature space via contrastive learning. Extensive experiments demonstrate that our method can achieve state-of-the-art results on four test datasets, as well as successfully generalize to real-world scenes. The code is available at \url{https://github.com/Zhaozixiang1228/GDSR-SSDNet}.
CVAug 29, 2024Code
Enhancing Sound Source Localization via False Negative EliminationZengjie Song, Jiangshe Zhang, Yuxi Wang et al.
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.
CVSep 7, 2022
MSSPN: Automatic First Arrival Picking using Multi-Stage Segmentation Picking NetworkHongtao Wang, Jiangshe Zhang, Xiaoli Wei et al.
Picking the first arrival times of prestack gathers is called First Arrival Time (FAT) picking, which is an indispensable step in seismic data processing, and is mainly solved manually in the past. With the current increasing density of seismic data collection, the efficiency of manual picking has been unable to meet the actual needs. Therefore, automatic picking methods have been greatly developed in recent decades, especially those based on deep learning. However, few of the current supervised deep learning-based method can avoid the dependence on labeled samples. Besides, since the gather data is a set of signals which are greatly different from the natural images, it is difficult for the current method to solve the FAT picking problem in case of a low Signal to Noise Ratio (SNR). In this paper, for hard rock seismic gather data, we propose a Multi-Stage Segmentation Pickup Network (MSSPN), which solves the generalization problem across worksites and the picking problem in the case of low SNR. In MSSPN, there are four sub-models to simulate the manually picking processing, which is assumed to four stages from coarse to fine. Experiments on seven field datasets with different qualities show that our MSSPN outperforms benchmarks by a large margin.Particularly, our method can achieve more than 90\% accurate picking across worksites in the case of medium and high SNRs, and even fine-tuned model can achieve 88\% accurate picking of the dataset with low SNR.
MLFeb 9, 2023
Information Theoretical Importance Sampling ClusteringJiangshe Zhang, Lizhen Ji, Meng Wang
A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in most real-world scenarios. In this paper, we propose an information theoretical importance sampling based approach for clustering problems (ITISC) which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method. The optimization problem can be solved by both an alternative optimization algorithm or a general optimization routine by commercially available software. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed model. Furthermore, we show that fuzzy c-means is a special case of ITISC with the logarithmic distortion, and this observation provides an interesting physical interpretation for fuzzy exponent $m$.
LGJun 9, 2022
Learning Non-Vacuous Generalization Bounds from OptimizationChengli Tan, Jiangshe Zhang, Junmin Liu
One of the fundamental challenges in the deep learning community is to theoretically understand how well a deep neural network generalizes to unseen data. However, current approaches often yield generalization bounds that are either too loose to be informative of the true generalization error or only valid to the compressed nets. In this study, we present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by stochastic gradient algorithms is essentially fractal-like and thus can derive a tighter bound over the algorithm-dependent Rademacher complexity. The main argument rests on modeling the discrete-time recursion process via a continuous-time stochastic differential equation driven by fractional Brownian motion. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks such as ResNet and Vision Transformer, even when they are trained on a large-scale dataset (e.g. ImageNet-1K).
CVFeb 3, 2024Code
Image Fusion via Vision-Language ModelZixiang Zhao, Lilun Deng, Haowen Bai et al.
Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion. Code and dataset are available at https://github.com/Zhaozixiang1228/IF-FILM.
MLFeb 9, 2023
An information-theoretic learning model based on importance samplingJiangshe Zhang, Lizhen Ji, Fei Gao et al.
A crucial assumption underlying the most current theory of machine learning is that the training distribution is identical to the test distribution. However, this assumption may not hold in some real-world applications. In this paper, we develop a learning model based on principles of information theory by minimizing the worst-case loss at prescribed levels of uncertainty. We reformulate the empirical estimation of the risk functional and the distribution deviation constraint based on the importance sampling method. The objective of the proposed approach is to minimize the loss under maximum degradation and hence the resulting problem is a minimax problem which can be converted to an unconstrained minimum problem using the Lagrange method with the Lagrange multiplier $T$. We reveal that the minimization of the objective function under logarithmic transformation is equivalent to the minimization of the p-norm loss with $p=\frac{1}{T}$. We applied the proposed model to the face verification task on Racial Faces in the Wild datasets and showed that the proposed model performs better under large distribution deviations.
CVMar 10, 2025Code
Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image FusionHaowen Bai, Jiangshe Zhang, Zixiang Zhao et al.
Multi-exposure image fusion (MEF) synthesizes multiple, differently exposed images of the same scene into a single, well-exposed composite. Retinex theory, which separates image illumination from scene reflectance, provides a natural framework to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components with a shared reflectance component, and effectively models the glare induced by overexposure. The shared reflectance is learned via a bidirectional loss, which enables our approach to effectively mitigate the glare effect. Furthermore, we introduce a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of a fixed exposure level. Extensive experiments on diverse datasets, including underexposure-overexposure fusion, exposure controlled fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code is available at https://github.com/HaowenBai/Retinex-MEF
CVMay 19, 2023Code
Equivariant Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Jiangshe Zhang et al.
Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
CVApr 14, 2021Code
Discrete Cosine Transform Network for Guided Depth Map Super-ResolutionZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Guided depth super-resolution (GDSR) is an essential topic in multi-modal image processing, which reconstructs high-resolution (HR) depth maps from low-resolution ones collected with suboptimal conditions with the help of HR RGB images of the same scene. To solve the challenges in interpreting the working mechanism, extracting cross-modal features and RGB texture over-transferred, we propose a novel Discrete Cosine Transform Network (DCTNet) to alleviate the problems from three aspects. First, the Discrete Cosine Transform (DCT) module reconstructs the multi-channel HR depth features by using DCT to solve the channel-wise optimization problem derived from the image domain. Second, we introduce a semi-coupled feature extraction module that uses shared convolutional kernels to extract common information and private kernels to extract modality-specific information. Third, we employ an edge attention mechanism to highlight the contours informative for guided upsampling. Extensive quantitative and qualitative evaluations demonstrate the effectiveness of our DCTNet, which outperforms previous state-of-the-art methods with a relatively small number of parameters. The code is available at \url{https://github.com/Zhaozixiang1228/GDSR-DCTNet}.
CVMar 10, 2021Code
Deep Convolutional Sparse Coding Network for Pansharpening with Guidance of Side InformationShuang Xu, Jiangshe Zhang, Kai Sun et al.
Pansharpening is a fundamental issue in remote sensing field. This paper proposes a side information partially guided convolutional sparse coding (SCSC) model for pansharpening. The key idea is to split the low resolution multispectral image into a panchromatic image related feature map and a panchromatic image irrelated feature map, where the former one is regularized by the side information from panchromatic images. With the principle of algorithm unrolling techniques, the proposed model is generalized as a deep neural network, called as SCSC pansharpening neural network (SCSC-PNN). Compared with 13 classic and state-of-the-art methods on three satellites, the numerical experiments show that SCSC-PNN is superior to others. The codes are available at https://github.com/xsxjtu/SCSC-PNN.
CVMar 8, 2021Code
Deep Gradient Projection Networks for Pan-sharpeningShuang Xu, Jiangshe Zhang, Zixiang Zhao et al.
Pan-sharpening is an important technique for remote sensing imaging systems to obtain high resolution multispectral images. Recently, deep learning has become the most popular tool for pan-sharpening. This paper develops a model-based deep pan-sharpening approach. Specifically, two optimization problems regularized by the deep prior are formulated, and they are separately responsible for the generative models for panchromatic images and low resolution multispectral images. Then, the two problems are solved by a gradient projection algorithm, and the iterative steps are generalized into two network blocks. By alternatively stacking the two blocks, a novel network, called gradient projection based pan-sharpening neural network, is constructed. The experimental results on different kinds of satellite datasets demonstrate that the new network outperforms state-of-the-art methods both visually and quantitatively. The codes are available at https://github.com/xsxjtu/GPPNN.
CVDec 29, 2020Code
Towards Reducing Severe Defocus Spread Effects for Multi-Focus Image Fusion via an Optimization Based StrategyShuang Xu, Lizhen Ji, Zhe Wang et al.
Multi-focus image fusion (MFF) is a popular technique to generate an all-in-focus image, where all objects in the scene are sharp. However, existing methods pay little attention to defocus spread effects of the real-world multi-focus images. Consequently, most of the methods perform badly in the areas near focus map boundaries. According to the idea that each local region in the fused image should be similar to the sharpest one among source images, this paper presents an optimization-based approach to reduce defocus spread effects. Firstly, a new MFF assessmentmetric is presented by combining the principle of structure similarity and detected focus maps. Then, MFF problem is cast into maximizing this metric. The optimization is solved by gradient ascent. Experiments conducted on the real-world dataset verify superiority of the proposed model. The codes are available at https://github.com/xsxjtu/MFF-SSIM.
CVDec 13, 2023
ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss via Meta-LearningHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model's flexibility and generalizability for various fusion tasks. To address these limitations, we propose ReFusion, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion.
CVDec 4, 2024
Task-driven Image Fusion with Learnable Fusion LossHaowen Bai, Jiangshe Zhang, Zixiang Zhao et al.
Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual features compared to single-source images, often improving downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the downstream task loss in a meta-learning manner. The learning objective is to minimize the task loss of fused images after optimizing the fusion module with the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies entirely on the downstream task loss, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance through fusion experiments conducted on four different datasets, in addition to evaluations on semantic segmentation and object detection tasks.
CVFeb 3, 2025
Deep Unfolding Multi-modal Image Fusion Network via Attribution AnalysisHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network'' (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method's superiority in image fusion and applicability to semantic segmentation.
LGJan 14, 2024
Stabilizing Sharpness-aware Minimization Through A Simple Renormalization StrategyChengli Tan, Jiangshe Zhang, Junmin Liu et al.
Recently, sharpness-aware minimization (SAM) has attracted much attention because of its surprising effectiveness in improving generalization performance. However, compared to stochastic gradient descent (SGD), it is more prone to getting stuck at the saddle points, which as a result may lead to performance degradation. To address this issue, we propose a simple renormalization strategy, dubbed Stable SAM (SSAM), so that the gradient norm of the descent step maintains the same as that of the ascent step. Our strategy is easy to implement and flexible enough to integrate with SAM and its variants, almost at no computational cost. With elementary tools from convex optimization and learning theory, we also conduct a theoretical analysis of sharpness-aware training, revealing that compared to SGD, the effectiveness of SAM is only assured in a limited regime of learning rate. In contrast, we show how SSAM extends this regime of learning rate and then it can consistently perform better than SAM with the minor modification. Finally, we demonstrate the improved performance of SSAM on several representative data sets and tasks.
LGApr 12, 2024
Seismic First Break Picking in a Higher Dimension Using Deep Graph LearningHongtao Wang, Li Long, Jiangshe Zhang et al.
Contemporary automatic first break (FB) picking methods typically analyze 1D signals, 2D source gathers, or 3D source-receiver gathers. Utilizing higher-dimensional data, such as 2D or 3D, incorporates global features, improving the stability of local picking. Despite the benefits, high-dimensional data requires structured input and increases computational demands. Addressing this, we propose a novel approach using deep graph learning called DGL-FB, constructing a large graph to efficiently extract information. In this graph, each seismic trace is represented as a node, connected by edges that reflect similarities. To manage the size of the graph, we develop a subgraph sampling technique to streamline model training and inference. Our proposed framework, DGL-FB, leverages deep graph learning for FB picking. It encodes subgraphs into global features using a deep graph encoder. Subsequently, the encoded global features are combined with local node signals and fed into a ResUNet-based 1D segmentation network for FB detection. Field survey evaluations of DGL-FB show superior accuracy and stability compared to a 2D U-Net-based benchmark method.
CVFeb 3, 2025
Simultaneous Automatic Picking and Manual Picking Refinement for First-BreakHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method's performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR's resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
LGMay 29, 2025
Towards Understanding The Calibration Benefits of Sharpness-Aware MinimizationChengli Tan, Yubo Zhou, Haishan Ye et al.
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches
CVMar 8, 2025
A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep LearningHongtao Wang, Jiandong Liang, Lei Wang et al.
Residual moveout (RMO) provides critical information for travel time tomography. The current industry-standard method for fitting RMO involves scanning high-order polynomial equations. However, this analytical approach does not accurately capture local saltation, leading to low iteration efficiency in tomographic inversion. Supervised learning-based image segmentation methods for picking can effectively capture local variations; however, they encounter challenges such as a scarcity of reliable training samples and the high complexity of post-processing. To address these issues, this study proposes a deep learning-based cascade picking method. It distinguishes accurate and robust RMOs using a segmentation network and a post-processing technique based on trend regression. Additionally, a data synthesis method is introduced, enabling the segmentation network to be trained on synthetic datasets for effective picking in field data. Furthermore, a set of metrics is proposed to quantify the quality of automatically picked RMOs. Experimental results based on both model and real data demonstrate that, compared to semblance-based methods, our approach achieves greater picking density and accuracy.
CVMay 23, 2023
UPNet: Uncertainty-based Picking Deep Learning Network for Robust First Break PickingHongtao Wang, Jiangshe Zhang, Xiaoli Wei et al.
In seismic exploration, first break (FB) picking is a crucial aspect in the determination of subsurface velocity models, significantly influencing the placement of wells. Many deep neural networks (DNNs)-based automatic picking methods have been proposed to accelerate this processing. Significantly, the segmentation-based DNN methods provide a segmentation map and then estimate FB from the map using a picking threshold. However, the uncertainty of the results picked by DNNs still needs to be analyzed. Thus, the automatic picking methods applied in field datasets can not ensure robustness, especially in the case of a low signal-to-noise ratio (SNR). In this paper, we introduce uncertainty quantification into the FB picking task and propose a novel uncertainty-based picking deep learning network called UPNet. UPNet not only estimates the uncertainty of network output but also can filter the pickings with low confidence. Many experiments evaluate that UPNet exhibits higher accuracy and robustness than the deterministic DNN-based model, achieving State-of-the-Art (SOTA) performance in field surveys. In addition, we verify that the measurement uncertainty is meaningful, which can provide a reference for human decision-making.
LGMay 5, 2021
Understanding Short-Range Memory Effects in Deep Neural NetworksChengli Tan, Jiangshe Zhang, Junmin Liu
Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is ascribed to the stochastic gradient noise (SGN) incurred in the training process. Based on this consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of stochastic differential equations (SDEs) driven by either Brownian or Levy stable motion. In this study, we argue that SGN is neither Gaussian nor Levy stable. Instead, inspired by the short-range correlation emerging in the SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by fractional Brownian motion (FBM). Accordingly, the different convergence behavior of SGD dynamics is well-grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. The result suggests a lower escaping rate for a larger Hurst parameter, and thus SGD stays longer in flat minima. This happens to coincide with the well-known phenomenon that SGD favors flat minima that generalize well. Extensive experiments are conducted to validate our conjecture, and it is demonstrated that short-range memory effects persist across various model architectures, datasets, and training strategies. Our study opens up a new perspective and may contribute to a better understanding of SGD.
CVDec 31, 2020
FGF-GAN: A Lightweight Generative Adversarial Network for Pansharpening via Fast Guided FilterZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Pansharpening is a widely used image enhancement technique for remote sensing. Its principle is to fuse the input high-resolution single-channel panchromatic (PAN) image and low-resolution multi-spectral image and to obtain a high-resolution multi-spectral (HRMS) image. The existing deep learning pansharpening method has two shortcomings. First, features of two input images need to be concatenated along the channel dimension to reconstruct the HRMS image, which makes the importance of PAN images not prominent, and also leads to high computational cost. Second, the implicit information of features is difficult to extract through the manually designed loss function. To this end, we propose a generative adversarial network via the fast guided filter (FGF) for pansharpening. In generator, traditional channel concatenation is replaced by FGF to better retain the spatial information while reducing the number of parameters. Meanwhile, the fusion objects can be highlighted by the spatial attention module. In addition, the latent information of features can be preserved effectively through adversarial training. Numerous experiments illustrate that our network generates high-quality HRMS images that can surpass existing methods, and with fewer parameters.
CVDec 16, 2020
Domain Adaptive Object Detection via Feature Separation and AlignmentChengyang Liang, Zixiang Zhao, Junmin Liu et al.
Recently, adversarial-based domain adaptive object detection (DAOD) methods have been developed rapidly. However, there are two issues that need to be resolved urgently. Firstly, numerous methods reduce the distributional shifts only by aligning all the feature between the source and target domain, while ignoring the private information of each domain. Secondly, DAOD should consider the feature alignment on object existing regions in images. But redundancy of the region proposals and background noise could reduce the domain transferability. Therefore, we establish a Feature Separation and Alignment Network (FSANet) which consists of a gray-scale feature separation (GSFS) module, a local-global feature alignment (LGFA) module and a region-instance-level alignment (RILA) module. The GSFS module decomposes the distractive/shared information which is useless/useful for detection by a dual-stream framework, to focus on intrinsic feature of objects and resolve the first issue. Then, LGFA and RILA modules reduce the distributional shifts of the multi-level features. Notably, scale-space filtering is exploited to implement adaptive searching for regions to be aligned, and instance-level features in each region are refined to reduce redundancy and noise mentioned in the second issue. Various experiments on multiple benchmark datasets prove that our FSANet achieves better performance on the target domain detection and surpasses the state-of-the-art methods.
CVSep 21, 2020
MFIF-GAN: A New Generative Adversarial Network for Multi-Focus Image FusionYicheng Wang, Shuang Xu, Junmin Liu et al.
Multi-Focus Image Fusion (MFIF) is a promising image enhancement technique to obtain all-in-focus images meeting visual needs and it is a precondition of other computer vision tasks. One of the research trends of MFIF is to avoid the defocus spread effect (DSE) around the focus/defocus boundary (FDB). In this paper,we propose a network termed MFIF-GAN to attenuate the DSE by generating focus maps in which the foreground region are correctly larger than the corresponding objects. The Squeeze and Excitation Residual module is employed in the network. By combining the prior knowledge of training condition, this network is trained on a synthetic dataset based on an α-matte model. In addition, the reconstruction and gradient regularization terms are combined in the loss functions to enhance the boundary details and improve the quality of fused images. Extensive experiments demonstrate that the MFIF-GAN outperforms several state-of-the-art (SOTA) methods in visual perception, quantitative analysis as well as efficiency. Moreover, the edge diffusion and contraction module is firstly proposed to verify that focus maps generated by our method are accurate at the pixel level.
IVSep 2, 2020
When Image Decomposition Meets Deep Learning: A Novel Infrared and Visible Image Fusion MethodZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Infrared and visible image fusion, as a hot topic in image processing and image enhancement, aims to produce fused images retaining the detail texture information in visible images and the thermal radiation information in infrared images. A critical step for this issue is to decompose features in different scales and to merge them separately. In this paper, we propose a novel dual-stream auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into base and detail feature maps with low- and high-frequency information, respectively, and that the decoder is responsible for the original image reconstruction. To this end, a well-designed loss function is established to make the base/detail feature maps similar/dissimilar. In the test phase, base and detail feature maps are respectively merged via an additional fusion layer, which contains a saliency weighted-based spatial attention module and a channel attention module to adaptively preserve more information from source images and to highlight the objects. Then the fused image is recovered by the decoder. Qualitative and quantitative results demonstrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong reproducibility and meanwhile is superior to the state-of-the-art (SOTA) approaches.
IVMay 18, 2020
Deep Convolutional Sparse Coding Networks for Image FusionShuang Xu, Zixiang Zhao, Yicheng Wang et al.
Image fusion is a significant problem in many fields including digital photography, computational imaging and remote sensing, to name but a few. Recently, deep learning has emerged as an important tool for image fusion. This paper presents three deep convolutional sparse coding (CSC) networks for three kinds of image fusion tasks (i.e., infrared and visible image fusion, multi-exposure image fusion, and multi-modal image fusion). The CSC model and the iterative shrinkage and thresholding algorithm are generalized into dictionary convolution units. As a result, all hyper-parameters are learned from data. Our extensive experiments and comprehensive comparisons reveal the superiority of the proposed networks with regard to quantitative evaluation and visual inspection.
CVMay 12, 2020
Efficient and Model-Based Infrared and Visible Image Fusion Via Algorithm UnrollingZixiang Zhao, Shuang Xu, Jiangshe Zhang et al.
Infrared and visible image fusion (IVIF) expects to obtain images that retain thermal radiation information from infrared images and texture details from visible images. In this paper, a model-based convolutional neural network (CNN) model, referred to as Algorithm Unrolling Image Fusion (AUIF), is proposed to overcome the shortcomings of traditional CNN-based IVIF models. The proposed AUIF model starts with the iterative formulas of two traditional optimization models, which are established to accomplish two-scale decomposition, i.e., separating low-frequency base information and high-frequency detail information from source images. Then the algorithm unrolling is implemented where each iteration is mapped to a CNN layer and each optimization model is transformed into a trainable neural network. Compared with the general network architectures, the proposed framework combines the model-based prior information and is designed more reasonably. After the unrolling operation, our model contains two decomposers (encoders) and an additional reconstructor (decoder). In the training phase, this network is trained to reconstruct the input image. While in the test phase, the base (or detail) decomposed feature maps of infrared/visible images are merged respectively by an extra fusion layer, and then the decoder outputs the fusion image. Qualitative and quantitative comparisons demonstrate the superiority of our model, which can robustly generate fusion images containing highlight targets and legible details, exceeding the state-of-the-art methods. Furthermore, our network has fewer weights and faster speed.
CVMay 12, 2020
Bayesian Fusion for Infrared and Visible ImagesZixiang Zhao, Shuang Xu, Chunxia Zhang et al.
Infrared and visible image fusion has been a hot issue in image fusion. In this task, a fused image containing both the gradient and detailed texture information of visible images as well as the thermal radiation and highlighting targets of infrared images is expected to be obtained. In this paper, a novel Bayesian fusion model is established for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in a hierarchical Bayesian manner. Aiming at making the fused image satisfy human visual system, the model incorporates the total-variation(TV) penalty. Subsequently, the model is efficiently inferred by the expectation-maximization(EM) algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.
IVMar 20, 2020
DIDFuse: Deep Image Decomposition for Infrared and Visible Image FusionZixiang Zhao, Shuang Xu, Chunxia Zhang et al.
Infrared and visible image fusion, a hot topic in the field of image processing, aims at obtaining fused images keeping the advantages of source images. This paper proposes a novel auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into background and detail feature maps with low- and high-frequency information, respectively, and that the decoder recovers the original image. To this end, the loss function makes the background/detail feature maps of source images similar/dissimilar. In the test phase, background and detail feature maps are respectively merged via a fusion module, and the fused image is recovered by the decoder. Qualitative and quantitative results illustrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong robustness and meanwhile surpass state-of-the-art (SOTA) approaches.
CVFeb 12, 2020
MFFW: A new dataset for multi-focus image fusionShuang Xu, Xiaoli Wei, Chunxia Zhang et al.
Multi-focus image fusion (MFF) is a fundamental task in the field of computational photography. Current methods have achieved significant performance improvement. It is found that current methods are evaluated on simulated image sets or Lytro dataset. Recently, a growing number of researchers pay attention to defocus spread effect, a phenomenon of real-world multi-focus images. Nonetheless, defocus spread effect is not obvious in simulated or Lytro datasets, where popular methods perform very similar. To compare their performance on images with defocus spread effect, this paper constructs a new dataset called MFF in the wild (MFFW). It contains 19 pairs of multi-focus images collected on the Internet. We register all pairs of source images, and provide focus maps and reference images for part of pairs. Compared with Lytro dataset, images in MFFW significantly suffer from defocus spread effect. In addition, the scenes of MFFW are more complex. The experiments demonstrate that most state-of-the-art methods on MFFW dataset cannot robustly generate satisfactory fusion images. MFFW can be a new baseline dataset to test whether an MMF algorithm is able to deal with defocus spread effect.
LGJan 22, 2020
Toward a Controllable Disentanglement NetworkZengjie Song, Oluwasanmi Koyejo, Jiangshe Zhang
This paper addresses two crucial problems of learning disentangled image representations, namely controlling the degree of disentanglement during image editing, and balancing the disentanglement strength and the reconstruction quality. To encourage disentanglement, we devise a distance covariance based decorrelation regularization. Further, for the reconstruction step, our model leverages a soft target representation combined with the latent image code. By exploring the real-valued space of the soft target representation, we are able to synthesize novel images with the designated properties. To improve the perceptual quality of images generated by autoencoder (AE)-based models, we extend the encoder-decoder architecture with the generative adversarial network (GAN) by collapsing the AE decoder and the GAN generator into one. We also design a classification based protocol to quantitatively evaluate the disentanglement strength of our model. Experimental results showcase the benefits of the proposed model.
LGDec 25, 2019
Learning Controllable Disentangled Representations with Decorrelation RegularizationZengjie Song, Oluwasanmi Koyejo, Jiangshe Zhang
A crucial problem in learning disentangled image representations is controlling the degree of disentanglement during image editing, while preserving the identity of objects. In this work, we propose a simple yet effective model with the encoder-decoder architecture to address this challenge. To encourage disentanglement, we devise a distance covariance based decorrelation regularization. Further, for the reconstruction step, our model leverages a soft target representation combined with the latent image code. By exploiting the real-valued space of the soft target representations, we are able to synthesize novel images with the designated properties. We also design a classification based protocol to quantitatively evaluate the disentanglement strength of our model. Experimental results show that the proposed model competently disentangles factors of variation, and is able to manipulate face images to synthesize the desired attributes.
MLJan 1, 2019
Adaptive Quantile Low-Rank Matrix FactorizationShuang Xu, Chun-Xia Zhang, Jiangshe Zhang
Low-rank matrix factorization (LRMF) has received much popularity owing to its successful applications in both computer vision and data mining. By assuming noise to come from a Gaussian, Laplace or mixture of Gaussian distributions, significant efforts have been made on optimizing the (weighted) $L_1$ or $L_2$-norm loss between an observed matrix and its bilinear factorization. However, the type of noise distribution is generally unknown in real applications and inappropriate assumptions will inevitably deteriorate the behavior of LRMF. On the other hand, real data are often corrupted by skew rather than symmetric noise. To tackle this problem, this paper presents a novel LRMF model called AQ-LRMF by modeling noise with a mixture of asymmetric Laplace distributions. An efficient algorithm based on the expectation-maximization (EM) algorithm is also offered to estimate the parameters involved in AQ-LRMF. The AQ-LRMF model possesses the advantage that it can approximate noise well no matter whether the real noise is symmetric or skew. The core idea of AQ-LRMF lies in solving a weighted $L_1$ problem with weights being learned from data. The experiments conducted on synthetic and real datasets show that AQ-LRMF outperforms several state-of-the-art techniques. Furthermore, AQ-LRMF also has the superiority over the other algorithms in terms of capturing local structural information contained in real images.
MLDec 11, 2018
Variational Bayesian Weighted Complex Network ReconstructionShuang Xu, Chun-Xia Zhang, Pei Wang et al.
Complex network reconstruction is a hot topic in many fields. Currently, the most popular data-driven reconstruction framework is based on lasso. However, it is found that, in the presence of noise, lasso loses efficiency for weighted networks. This paper builds a new framework to cope with this problem. The key idea is to employ a series of linear regression problems to model the relationship between network nodes, and then to use an efficient variational Bayesian algorithm to infer the unknown coefficients. The numerical experiments conducted on both synthetic and real data demonstrate that the new method outperforms lasso with regard to both reconstruction accuracy and running speed.