Guiduo Duan

CV
h-index14
9papers
36citations
Novelty62%
AI Score59

9 Papers

LGNov 19, 2022Code
Unifying Label-inputted Graph Neural Networks with Deep Equilibrium Models

Yi Luo, Guiduo Duan, Guangchun Luo et al.

The success of Graph Neural Networks (GNN) in learning on non-Euclidean data arouses many subtopics, such as Label-inputted GNN (LGNN) and Implicit GNN (IGNN). LGNN, explicitly inputting supervising information (a.k.a. labels) in GNN, integrates label propagation to achieve superior performance, but with the dilemma between its propagating distance and adaptiveness. IGNN, outputting an equilibrium point by iterating its network infinite times, exploits information in the entire graph to capture long-range dependencies, but with its network constrained to guarantee the existence of the equilibrium. This work unifies the two subdomains by interpreting LGNN in the theory of IGNN and reducing prevailing LGNNs to the form of IGNN. The unification facilitates the exchange between the two subdomains and inspires more studies. Specifically, implicit differentiation of IGNN is introduced to LGNN to differentiate its infinite-range label propagation with constant memory, making the propagation both distant and adaptive. Besides, the masked label strategy of LGNN is proven able to guarantee the well-posedness of IGNN in a network-agnostic manner, granting its network more complex and thus more expressive. Combining the advantages of LGNN and IGNN, Label-inputted Implicit GNN (LI-GNN) is proposed. It can be widely applied to any specific GNN to boost its performance. Node classification experiments on two synthesized and six real-world datasets demonstrate its effectiveness. Code is available at https://github.com/cf020031308/LI-GNN

CVMar 20Code
Unbiased Dynamic Multimodal Fusion

Shicai Wei, Kaijie Zhang, Luyi Chen et al.

Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.

LGDec 22, 2025Code
Outlier detection in mixed-attribute data: a semi-supervised approach with fuzzy approximations and relative entropy

Baiyang Chen, Zhong Yuan, Zheng Liu et al.

Outlier detection is a critical task in data mining, aimed at identifying objects that significantly deviate from the norm. Semi-supervised methods improve detection performance by leveraging partially labeled data but typically overlook the uncertainty and heterogeneity of real-world mixed-attribute data. This paper introduces a semi-supervised outlier detection method, namely fuzzy rough sets-based outlier detection (FROD), to effectively handle these challenges. Specifically, we first utilize a small subset of labeled data to construct fuzzy decision systems, through which we introduce the attribute classification accuracy based on fuzzy approximations to evaluate the contribution of attribute sets in outlier detection. Unlabeled data is then used to compute fuzzy relative entropy, which provides a characterization of outliers from the perspective of uncertainty. Finally, we develop the detection algorithm by combining attribute classification accuracy with fuzzy relative entropy. Experimental results on 16 public datasets show that FROD is comparable with or better than leading detection algorithms. All datasets and source codes are accessible at https://github.com/ChenBaiyang/FROD. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.ijar.2025.109373

LGMay 7, 2025Code
Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks

Xuyang Wang, Siyuan Duan, Qizhi Li et al.

Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.

LGDec 26, 2025
LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Tianxi Huang, Qian Dong et al.

Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.

CVJul 8, 2025
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

Xin Hu, Ke Qin, Guiduo Duan et al.

Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.

CVNov 19, 2025
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition

Wen Yin, Siyu Zhan, Cencen Liu et al.

Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.

IVSep 19, 2025
Uncertainty-Gated Deformable Network for Breast Tumor Segmentation in MR Images

Yue Zhang, Jiahua Dong, Chengtao Peng et al.

Accurate segmentation of breast tumors in magnetic resonance images (MRI) is essential for breast cancer diagnosis, yet existing methods face challenges in capturing irregular tumor shapes and effectively integrating local and global features. To address these limitations, we propose an uncertainty-gated deformable network to leverage the complementary information from CNN and Transformers. Specifically, we incorporates deformable feature modeling into both convolution and attention modules, enabling adaptive receptive fields for irregular tumor contours. We also design an Uncertainty-Gated Enhancing Module (U-GEM) to selectively exchange complementary features between CNN and Transformer based on pixel-wise uncertainty, enhancing both local and global representations. Additionally, a Boundary-sensitive Deep Supervision Loss is introduced to further improve tumor boundary delineation. Comprehensive experiments on two clinical breast MRI datasets demonstrate that our method achieves superior segmentation performance compared with state-of-the-art methods, highlighting its clinical potential for accurate breast tumor delineation.

CVJan 26, 2024
Towards Lifelong Scene Graph Generation with Knowledge-ware In-context Prompt Learning

Tao He, Tongtong Wu, Dongyang Zhang et al.

Scene graph generation (SGG) endeavors to predict visual relationships between pairs of objects within an image. Prevailing SGG methods traditionally assume a one-off learning process for SGG. This conventional paradigm may necessitate repetitive training on all previously observed samples whenever new relationships emerge, mitigating the risk of forgetting previously acquired knowledge. This work seeks to address this pitfall inherent in a suite of prior relationship predictions. Motivated by the achievements of in-context learning in pretrained language models, our approach imbues the model with the capability to predict relationships and continuously acquire novel knowledge without succumbing to catastrophic forgetting. To achieve this goal, we introduce a novel and pragmatic framework for scene graph generation, namely Lifelong Scene Graph Generation (LSGG), where tasks, such as predicates, unfold in a streaming fashion. In this framework, the model is constrained to exclusive training on the present task, devoid of access to previously encountered training data, except for a limited number of exemplars, but the model is tasked with inferring all predicates it has encountered thus far. Rigorous experiments demonstrate the superiority of our proposed method over state-of-the-art SGG models in the context of LSGG across a diverse array of metrics. Besides, extensive experiments on the two mainstream benchmark datasets, VG and Open-Image(v6), show the superiority of our proposed model to a number of competitive SGG models in terms of continuous learning and conventional settings. Moreover, comprehensive ablation experiments demonstrate the effectiveness of each component in our model.