CVNov 26, 2022Code
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Jiangshe Zhang et al. · eth-zurich, harvard
Multi-modality (MM) image fusion aims to render fused images that maintain the merits of different modalities, e.g., functional highlight and detailed textures. To tackle the challenge in modeling cross-modality features and decomposing desirable modality-specific and modality-shared features, we propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. Firstly, CDDFuse uses Restormer blocks to extract cross-modality shallow features. We then introduce a dual-branch Transformer-CNN feature extractor with Lite Transformer (LT) blocks leveraging long-range attention to handle low-frequency global features and Invertible Neural Networks (INN) blocks focusing on extracting high-frequency local information. A correlation-driven loss is further proposed to make the low-frequency features correlated while the high-frequency features uncorrelated based on the embedded information. Then, the LT-based global fusion and INN-based local fusion layers output the fused image. Extensive experiments demonstrate that our CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. We also show that CDDFuse can boost the performance in downstream infrared-visible semantic segmentation and object detection in a unified benchmark. The code is available at https://github.com/Zhaozixiang1228/MMIF-CDDFuse.
CVMar 13, 2023Code
DDFM: Denoising Diffusion Model for Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Yuanzhi Zhu et al. · eth-zurich
Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at \url{https://github.com/Zhaozixiang1228/MMIF-DDFM}.
CVMar 15, 2023Code
Spherical Space Feature Decomposition for Guided Depth Map Super-ResolutionZixiang Zhao, Jiangshe Zhang, Xiang Gu et al. · eth-zurich
Guided depth map super-resolution (GDSR), as a hot topic in multi-modal image processing, aims to upsample low-resolution (LR) depth maps with additional information involved in high-resolution (HR) RGB images from the same scene. The critical step of this task is to effectively extract domain-shared and domain-private RGB/depth features. In addition, three detailed issues, namely blurry edges, noisy surfaces, and over-transferred RGB texture, need to be addressed. In this paper, we propose the Spherical Space feature Decomposition Network (SSDNet) to solve the above issues. To better model cross-modality features, Restormer block-based RGB/depth encoders are employed for extracting local-global features. Then, the extracted features are mapped to the spherical space to complete the separation of private features and the alignment of shared features. Shared features of RGB are fused with the depth features to complete the GDSR task. Subsequently, a spherical contrast refinement (SCR) module is proposed to further address the detail issues. Patches that are classified according to imperfect categories are input into the SCR module, where the patch features are pulled closer to the ground truth and pushed away from the corresponding imperfect samples in the spherical feature space via contrastive learning. Extensive experiments demonstrate that our method can achieve state-of-the-art results on four test datasets, as well as successfully generalize to real-world scenes. The code is available at \url{https://github.com/Zhaozixiang1228/GDSR-SSDNet}.
IVJul 8, 2024Code
Pan-denoising: Guided Hyperspectral Image Denoising via Weighted Represent Coefficient Total VariationShuang Xu, Qiao Ke, Jiangjun Peng et al.
This paper introduces a novel paradigm for hyperspectral image (HSI) denoising, which is termed \textit{pan-denoising}. In a given scene, panchromatic (PAN) images capture similar structures and textures to HSIs but with less noise. This enables the utilization of PAN images to guide the HSI denoising process. Consequently, pan-denoising, which incorporates an additional prior, has the potential to uncover underlying structures and details beyond the internal information modeling of traditional HSI denoising methods. However, the proper modeling of this additional prior poses a significant challenge. To alleviate this issue, the paper proposes a novel regularization term, Panchromatic Weighted Representation Coefficient Total Variation (PWRCTV). It employs the gradient maps of PAN images to automatically assign different weights of TV regularization for each pixel, resulting in larger weights for smooth areas and smaller weights for edges. This regularization forms the basis of a pan-denoising model, which is solved using the Alternating Direction Method of Multipliers. Extensive experiments on synthetic and real-world datasets demonstrate that PWRCTV outperforms several state-of-the-art methods in terms of metrics and visual quality. Furthermore, an HSI classification experiment confirms that PWRCTV, as a preprocessing method, can enhance the performance of downstream classification tasks. The code and data are available at https://github.com/shuangxu96/PWRCTV.
CVAug 31, 2023
Neural Gradient RegularizerShuang Xu, Yifan Wang, Zixiang Zhao et al. · stanford
Owing to its significant success, the prior imposed on gradient maps has consistently been a subject of great interest in the field of image processing. Total variation (TV), one of the most representative regularizers, is known for its ability to capture the intrinsic sparsity prior underlying gradient maps. Nonetheless, TV and its variants often underestimate the gradient maps, leading to the weakening of edges and details whose gradients should not be zero in the original image (i.e., image structures is not describable by sparse priors of gradient maps). Recently, total deep variation (TDV) has been introduced, assuming the sparsity of feature maps, which provides a flexible regularization learned from large-scale datasets for a specific task. However, TDV requires to retrain the network with image/task variations, limiting its versatility. To alleviate this issue, in this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on any subjective sparsity or other prior assumptions on image gradient maps, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility.
CVFeb 3, 2024Code
Image Fusion via Vision-Language ModelZixiang Zhao, Lilun Deng, Haowen Bai et al.
Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion. Code and dataset are available at https://github.com/Zhaozixiang1228/IF-FILM.
CVFeb 5
FastVMT: Eliminating Redundancy in Video Motion TransferYue Ma, Zhikai Wang, Tianhao Ren et al.
Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
LGApr 1, 2024Code
Make Continual Learning Stronger via C-FlatAng Bian, Wei Li, Hangjie Yuan et al.
Model generalization ability upon incrementally acquiring dynamically updating knowledge from sequentially arriving tasks is crucial to tackle the sensitivity-stability dilemma in Continual Learning (CL). Weight loss landscape sharpness minimization seeking for flat minima lying in neighborhoods with uniform low loss or smooth gradient is proven to be a strong training regime improving model generalization compared with loss minimization based optimizer like SGD. Yet only a few works have discussed this training regime for CL, proving that dedicated designed zeroth-order sharpness optimizer can improve CL performance. In this work, we propose a Continual Flatness (C-Flat) method featuring a flatter loss landscape tailored for CL. C-Flat could be easily called with only one line of code and is plug-and-play to any CL methods. A general framework of C-Flat applied to all CL categories and a thorough comparison with loss minima optimizer and flat minima based CL approaches is presented in this paper, showing that our method can boost CL performance in almost all cases. Code is available at https://github.com/WanNaa/C-Flat.
CVMay 19
Stitched Value Model for Diffusion AlignmentHyojun Go, Hyungjin Chung, Prune Truong et al.
For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.
CVMay 28, 2025Code
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion TransformersWeilun Feng, Chuanguang Yang, Haotong Qin et al.
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.
CVApr 8, 2024Code
BinaryDM: Accurate Weight Binarization for Efficient Diffusion ModelsXingyu Zheng, Xianglong Liu, Haotong Qin et al.
With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. This paper proposes a novel weight binarization approach for DMs, namely BinaryDM, pushing binarized DMs to be accurate and efficient by improving the representation and optimization. From the representation perspective, we present an Evolvable-Basis Binarizer (EBB) to enable a smooth evolution of DMs from full-precision to accurately binarized. EBB enhances information representation in the initial stage through the flexible combination of multiple binary bases and applies regularization to evolve into efficient single-basis binarization. The evolution only occurs in the head and tail of the DM architecture to retain the stability of training. From the optimization perspective, a Low-rank Representation Mimicking (LRM) is applied to assist the optimization of binarized DMs. The LRM mimics the representations of full-precision DMs in low-rank space, alleviating the direction ambiguity of the optimization process caused by fine-grained alignment. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. With 1-bit weight and 4-bit activation (W1A4), BinaryDM achieves as low as 7.74 FID and saves the performance from collapse (baseline FID 10.87). As the first binarization method for diffusion models, W1A4 BinaryDM achieves impressive 15.2x OPs and 29.2x model size savings, showcasing its substantial potential for edge deployment. The code is available at https://github.com/Xingyu-Zheng/BinaryDM.
CVMar 10, 2025Code
Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image FusionHaowen Bai, Jiangshe Zhang, Zixiang Zhao et al.
Multi-exposure image fusion (MEF) synthesizes multiple, differently exposed images of the same scene into a single, well-exposed composite. Retinex theory, which separates image illumination from scene reflectance, provides a natural framework to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components with a shared reflectance component, and effectively models the glare induced by overexposure. The shared reflectance is learned via a bidirectional loss, which enables our approach to effectively mitigate the glare effect. Furthermore, we introduce a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of a fixed exposure level. Extensive experiments on diverse datasets, including underexposure-overexposure fusion, exposure controlled fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code is available at https://github.com/HaowenBai/Retinex-MEF
CVFeb 5
MambaVF: State Space Model for Efficient Video FusionZixiang Zhao, Yukun Cui, Lilun Deng et al.
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
LGApr 15
A Faster Path to Continual LearningWei Li, Hangjie Yuan, Zixiang Zhao et al.
Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0$\times$ to 1.25$\times$ faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.
LGAug 26, 2025Code
C-Flat++: Towards a More Efficient and Powerful Framework for Continual LearningWei Li, Hangjie Yuan, Zixiang Zhao et al.
Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.
CVJun 6, 2024Code
LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language ModelYixuan Yang, Junru Lu, Zixiang Zhao et al.
Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM's spatial understanding. Furthermore, through dialogue, LLplace activates the LLM's capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions. Code and dataset will be released.
CVMay 19, 2023Code
Equivariant Multi-Modality Image FusionZixiang Zhao, Haowen Bai, Jiangshe Zhang et al.
Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
CVApr 14, 2021Code
Discrete Cosine Transform Network for Guided Depth Map Super-ResolutionZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Guided depth super-resolution (GDSR) is an essential topic in multi-modal image processing, which reconstructs high-resolution (HR) depth maps from low-resolution ones collected with suboptimal conditions with the help of HR RGB images of the same scene. To solve the challenges in interpreting the working mechanism, extracting cross-modal features and RGB texture over-transferred, we propose a novel Discrete Cosine Transform Network (DCTNet) to alleviate the problems from three aspects. First, the Discrete Cosine Transform (DCT) module reconstructs the multi-channel HR depth features by using DCT to solve the channel-wise optimization problem derived from the image domain. Second, we introduce a semi-coupled feature extraction module that uses shared convolutional kernels to extract common information and private kernels to extract modality-specific information. Third, we employ an edge attention mechanism to highlight the contours informative for guided upsampling. Extensive quantitative and qualitative evaluations demonstrate the effectiveness of our DCTNet, which outperforms previous state-of-the-art methods with a relatively small number of parameters. The code is available at \url{https://github.com/Zhaozixiang1228/GDSR-DCTNet}.
CVMar 10, 2021Code
Deep Convolutional Sparse Coding Network for Pansharpening with Guidance of Side InformationShuang Xu, Jiangshe Zhang, Kai Sun et al.
Pansharpening is a fundamental issue in remote sensing field. This paper proposes a side information partially guided convolutional sparse coding (SCSC) model for pansharpening. The key idea is to split the low resolution multispectral image into a panchromatic image related feature map and a panchromatic image irrelated feature map, where the former one is regularized by the side information from panchromatic images. With the principle of algorithm unrolling techniques, the proposed model is generalized as a deep neural network, called as SCSC pansharpening neural network (SCSC-PNN). Compared with 13 classic and state-of-the-art methods on three satellites, the numerical experiments show that SCSC-PNN is superior to others. The codes are available at https://github.com/xsxjtu/SCSC-PNN.
CVMar 8, 2021Code
Deep Gradient Projection Networks for Pan-sharpeningShuang Xu, Jiangshe Zhang, Zixiang Zhao et al.
Pan-sharpening is an important technique for remote sensing imaging systems to obtain high resolution multispectral images. Recently, deep learning has become the most popular tool for pan-sharpening. This paper develops a model-based deep pan-sharpening approach. Specifically, two optimization problems regularized by the deep prior are formulated, and they are separately responsible for the generative models for panchromatic images and low resolution multispectral images. Then, the two problems are solved by a gradient projection algorithm, and the iterative steps are generalized into two network blocks. By alternatively stacking the two blocks, a novel network, called gradient projection based pan-sharpening neural network, is constructed. The experimental results on different kinds of satellite datasets demonstrate that the new network outperforms state-of-the-art methods both visually and quantitatively. The codes are available at https://github.com/xsxjtu/GPPNN.
LGJan 28
Continual GUI AgentsZiwei Liu, Borui Kang, Hangjie Yuan et al.
As digital environments (data distribution) are in flux, with new GUI data arriving over time-introducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.
CVDec 13, 2023
ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss via Meta-LearningHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model's flexibility and generalizability for various fusion tasks. To address these limitations, we propose ReFusion, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion.
CVDec 4, 2024
Task-driven Image Fusion with Learnable Fusion LossHaowen Bai, Jiangshe Zhang, Zixiang Zhao et al.
Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual features compared to single-source images, often improving downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the downstream task loss in a meta-learning manner. The learning objective is to minimize the task loss of fused images after optimizing the fusion module with the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies entirely on the downstream task loss, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance through fusion experiments conducted on four different datasets, in addition to evaluations on semantic segmentation and object detection tasks.
CVFeb 3, 2025
Deep Unfolding Multi-modal Image Fusion Network via Attribution AnalysisHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network'' (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method's superiority in image fusion and applicability to semantic segmentation.
CVFeb 3, 2025
Simultaneous Automatic Picking and Manual Picking Refinement for First-BreakHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method's performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR's resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
CVDec 5, 2024
Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic ImageShuang Xu, Zixiang Zhao, Haowen Bai et al.
Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces \textbf{H}yperspectral \textbf{I}mage Joint \textbf{Pand}enoising \textbf{a}nd Pan\textbf{s}harpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed zero-shot Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.
CVNov 17, 2025
ArtiWorld: LLM-Driven Articulation of 3D Objects in ScenesYixuan Yang, Luyang Xie, Zhen Luo et al.
Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.
LGMay 29, 2025
Towards Understanding The Calibration Benefits of Sharpness-Aware MinimizationChengli Tan, Yubo Zhou, Haishan Ye et al.
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches
CVApr 18, 2025
RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud CompletionYixuan Yang, Jinyu Yang, Zixiang Zhao et al.
The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.
CVDec 31, 2020
FGF-GAN: A Lightweight Generative Adversarial Network for Pansharpening via Fast Guided FilterZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Pansharpening is a widely used image enhancement technique for remote sensing. Its principle is to fuse the input high-resolution single-channel panchromatic (PAN) image and low-resolution multi-spectral image and to obtain a high-resolution multi-spectral (HRMS) image. The existing deep learning pansharpening method has two shortcomings. First, features of two input images need to be concatenated along the channel dimension to reconstruct the HRMS image, which makes the importance of PAN images not prominent, and also leads to high computational cost. Second, the implicit information of features is difficult to extract through the manually designed loss function. To this end, we propose a generative adversarial network via the fast guided filter (FGF) for pansharpening. In generator, traditional channel concatenation is replaced by FGF to better retain the spatial information while reducing the number of parameters. Meanwhile, the fusion objects can be highlighted by the spatial attention module. In addition, the latent information of features can be preserved effectively through adversarial training. Numerous experiments illustrate that our network generates high-quality HRMS images that can surpass existing methods, and with fewer parameters.
CVDec 16, 2020
Domain Adaptive Object Detection via Feature Separation and AlignmentChengyang Liang, Zixiang Zhao, Junmin Liu et al.
Recently, adversarial-based domain adaptive object detection (DAOD) methods have been developed rapidly. However, there are two issues that need to be resolved urgently. Firstly, numerous methods reduce the distributional shifts only by aligning all the feature between the source and target domain, while ignoring the private information of each domain. Secondly, DAOD should consider the feature alignment on object existing regions in images. But redundancy of the region proposals and background noise could reduce the domain transferability. Therefore, we establish a Feature Separation and Alignment Network (FSANet) which consists of a gray-scale feature separation (GSFS) module, a local-global feature alignment (LGFA) module and a region-instance-level alignment (RILA) module. The GSFS module decomposes the distractive/shared information which is useless/useful for detection by a dual-stream framework, to focus on intrinsic feature of objects and resolve the first issue. Then, LGFA and RILA modules reduce the distributional shifts of the multi-level features. Notably, scale-space filtering is exploited to implement adaptive searching for regions to be aligned, and instance-level features in each region are refined to reduce redundancy and noise mentioned in the second issue. Various experiments on multiple benchmark datasets prove that our FSANet achieves better performance on the target domain detection and surpasses the state-of-the-art methods.
CVSep 21, 2020
MFIF-GAN: A New Generative Adversarial Network for Multi-Focus Image FusionYicheng Wang, Shuang Xu, Junmin Liu et al.
Multi-Focus Image Fusion (MFIF) is a promising image enhancement technique to obtain all-in-focus images meeting visual needs and it is a precondition of other computer vision tasks. One of the research trends of MFIF is to avoid the defocus spread effect (DSE) around the focus/defocus boundary (FDB). In this paper,we propose a network termed MFIF-GAN to attenuate the DSE by generating focus maps in which the foreground region are correctly larger than the corresponding objects. The Squeeze and Excitation Residual module is employed in the network. By combining the prior knowledge of training condition, this network is trained on a synthetic dataset based on an α-matte model. In addition, the reconstruction and gradient regularization terms are combined in the loss functions to enhance the boundary details and improve the quality of fused images. Extensive experiments demonstrate that the MFIF-GAN outperforms several state-of-the-art (SOTA) methods in visual perception, quantitative analysis as well as efficiency. Moreover, the edge diffusion and contraction module is firstly proposed to verify that focus maps generated by our method are accurate at the pixel level.
IVSep 2, 2020
When Image Decomposition Meets Deep Learning: A Novel Infrared and Visible Image Fusion MethodZixiang Zhao, Jiangshe Zhang, Shuang Xu et al.
Infrared and visible image fusion, as a hot topic in image processing and image enhancement, aims to produce fused images retaining the detail texture information in visible images and the thermal radiation information in infrared images. A critical step for this issue is to decompose features in different scales and to merge them separately. In this paper, we propose a novel dual-stream auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into base and detail feature maps with low- and high-frequency information, respectively, and that the decoder is responsible for the original image reconstruction. To this end, a well-designed loss function is established to make the base/detail feature maps similar/dissimilar. In the test phase, base and detail feature maps are respectively merged via an additional fusion layer, which contains a saliency weighted-based spatial attention module and a channel attention module to adaptively preserve more information from source images and to highlight the objects. Then the fused image is recovered by the decoder. Qualitative and quantitative results demonstrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong reproducibility and meanwhile is superior to the state-of-the-art (SOTA) approaches.
IVMay 18, 2020
Deep Convolutional Sparse Coding Networks for Image FusionShuang Xu, Zixiang Zhao, Yicheng Wang et al.
Image fusion is a significant problem in many fields including digital photography, computational imaging and remote sensing, to name but a few. Recently, deep learning has emerged as an important tool for image fusion. This paper presents three deep convolutional sparse coding (CSC) networks for three kinds of image fusion tasks (i.e., infrared and visible image fusion, multi-exposure image fusion, and multi-modal image fusion). The CSC model and the iterative shrinkage and thresholding algorithm are generalized into dictionary convolution units. As a result, all hyper-parameters are learned from data. Our extensive experiments and comprehensive comparisons reveal the superiority of the proposed networks with regard to quantitative evaluation and visual inspection.
CVMay 12, 2020
Efficient and Model-Based Infrared and Visible Image Fusion Via Algorithm UnrollingZixiang Zhao, Shuang Xu, Jiangshe Zhang et al.
Infrared and visible image fusion (IVIF) expects to obtain images that retain thermal radiation information from infrared images and texture details from visible images. In this paper, a model-based convolutional neural network (CNN) model, referred to as Algorithm Unrolling Image Fusion (AUIF), is proposed to overcome the shortcomings of traditional CNN-based IVIF models. The proposed AUIF model starts with the iterative formulas of two traditional optimization models, which are established to accomplish two-scale decomposition, i.e., separating low-frequency base information and high-frequency detail information from source images. Then the algorithm unrolling is implemented where each iteration is mapped to a CNN layer and each optimization model is transformed into a trainable neural network. Compared with the general network architectures, the proposed framework combines the model-based prior information and is designed more reasonably. After the unrolling operation, our model contains two decomposers (encoders) and an additional reconstructor (decoder). In the training phase, this network is trained to reconstruct the input image. While in the test phase, the base (or detail) decomposed feature maps of infrared/visible images are merged respectively by an extra fusion layer, and then the decoder outputs the fusion image. Qualitative and quantitative comparisons demonstrate the superiority of our model, which can robustly generate fusion images containing highlight targets and legible details, exceeding the state-of-the-art methods. Furthermore, our network has fewer weights and faster speed.
CVMay 12, 2020
Bayesian Fusion for Infrared and Visible ImagesZixiang Zhao, Shuang Xu, Chunxia Zhang et al.
Infrared and visible image fusion has been a hot issue in image fusion. In this task, a fused image containing both the gradient and detailed texture information of visible images as well as the thermal radiation and highlighting targets of infrared images is expected to be obtained. In this paper, a novel Bayesian fusion model is established for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in a hierarchical Bayesian manner. Aiming at making the fused image satisfy human visual system, the model incorporates the total-variation(TV) penalty. Subsequently, the model is efficiently inferred by the expectation-maximization(EM) algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.
IVMar 20, 2020
DIDFuse: Deep Image Decomposition for Infrared and Visible Image FusionZixiang Zhao, Shuang Xu, Chunxia Zhang et al.
Infrared and visible image fusion, a hot topic in the field of image processing, aims at obtaining fused images keeping the advantages of source images. This paper proposes a novel auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into background and detail feature maps with low- and high-frequency information, respectively, and that the decoder recovers the original image. To this end, the loss function makes the background/detail feature maps of source images similar/dissimilar. In the test phase, background and detail feature maps are respectively merged via a fusion module, and the fused image is recovered by the decoder. Qualitative and quantitative results illustrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong robustness and meanwhile surpass state-of-the-art (SOTA) approaches.