Fang Chen

LG
h-index41
86papers
1,266citations
Novelty47%
AI Score57

86 Papers

IVMay 29
A physics-informed foundation model for quantitative diffusion MRI

Zihan Li, Jialan Zheng, Ziyu Li et al.

Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides the only noninvasive window into whole-brain microstructure in vivo, yet reliable quantitative mapping remains confined to specialized research settings requiring dense sampling and optimized acquisition protocols. To address this gap, we present a physics-informed generative microstructure network (PIGMENT) that learns a universal generative prior of human brain microstructure and adapts it zero-shot to each participant's measured data to recover subject-specific maps. Trained on 11375 scans spanning multiple sites, vendors, and field strengths, PIGMENT enabled reliable quantitative mapping for tensor, kurtosis, and NODDI models across external datasets from five independent centers. It remains effective where conventional fitting becomes unreliable, recovering meaningful maps from extremely sparse acquisitions while supporting downstream tractography and structural connectivity mapping. PIGMENT estimates demonstrated strong biological validity, preserving submillimeter cortical microarchitectural patterns and early-childhood white matter developmental trajectories from 10-fold accelerated scans. Furthermore, PIGMENT enables reliable quantitative tensor mapping on cost-efficient low-field systems and the extraction of tumor-related biomarkers using ultra-fast clinical protocols. Together, these results establish PIGMENT as a physics-informed foundation model that extends quantitative diffusion MRI into regimes traditionally too sparse, heterogeneous, or clinically constrained for reliable analysis.

LGMar 13, 2023
Meta-learning approaches for few-shot learning: A survey of recent advances

Hassan Gharoun, Fereshteh Momenifar, Fang Chen et al.

Despite its astounding success in learning deeper multi-dimensional data, the performance of deep learning declines on new unseen tasks mainly due to its focus on same-distribution prediction. Moreover, deep learning is notorious for poor generalization from few samples. Meta-learning is a promising approach that addresses these issues by adapting to new tasks with few-shot datasets. This survey first briefly introduces meta-learning and then investigates state-of-the-art meta-learning methods and recent advances in: (I) metric-based, (II) memory-based, (III), and learning-based methods. Finally, current challenges and insights for future researches are discussed.

CVSep 16, 2022
Self-Attentive Pooling for Efficient Deep Learning

Fang Chen, Gourav Datta, Souvik Kundu et al. · tsinghua

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.

CVDec 14, 2022
Domain Generalization by Learning and Removing Domain-specific Features

Yu Ding, Lei Wang, Bin Liang et al.

Deep Neural Networks (DNNs) suffer from domain shift when the test dataset follows a distribution different from the training dataset. Domain generalization aims to tackle this issue by learning a model that can generalize to unseen domains. In this paper, we propose a new approach that aims to explicitly remove domain-specific features for domain generalization. Following this approach, we propose a novel framework called Learning and Removing Domain-specific features for Generalization (LRDG) that learns a domain-invariant model by tactically removing domain-specific features from the input images. Specifically, we design a classifier to effectively learn the domain-specific features for each source domain, respectively. We then develop an encoder-decoder network to map each input image into a new image space where the learned domain-specific features are removed. With the images output by the encoder-decoder network, another classifier is designed to learn the domain-invariant features to conduct image classification. Extensive experiments demonstrate that our framework achieves superior performance compared with state-of-the-art methods.

CVDec 19, 2022
DGNet: Distribution Guided Efficient Learning for Oil Spill Image Segmentation

Fang Chen, Heiko Balzter, Feixiang Zhou et al.

Successful implementation of oil spill segmentation in Synthetic Aperture Radar (SAR) images is vital for marine environmental protection. In this paper, we develop an effective segmentation framework named DGNet, which performs oil spill segmentation by incorporating the intrinsic distribution of backscatter values in SAR images. Specifically, our proposed segmentation network is constructed with two deep neural modules running in an interactive manner, where one is the inference module to achieve latent feature variable inference from SAR images, and the other is the generative module to produce oil spill segmentation maps by drawing the latent feature variables as inputs. Thus, to yield accurate segmentation, we take into account the intrinsic distribution of backscatter values in SAR images and embed it in our segmentation model. The intrinsic distribution originates from SAR imagery, describing the physical characteristics of oil spills. In the training process, the formulated intrinsic distribution guides efficient learning of optimal latent feature variable inference for oil spill segmentation. The efficient learning enables the training of our proposed DGNet with a small amount of image data. This is economically beneficial to oil spill segmentation where the availability of oil spill SAR image data is limited in practice. Additionally, benefiting from optimal latent feature variable inference, our proposed DGNet performs accurate oil spill segmentation. We evaluate the segmentation performance of our proposed DGNet with different metrics, and experimental evaluations demonstrate its effective segmentations.

CVMay 29
Astra: a generalizable report generation foundation model for 3D computed tomography

Zhuhao Wang, Fang Chen, Chaohui Yu et al.

CT interpretation requires radiologists to review hundreds of volumetric slices per examination, making reporting time-consuming and highly expertise-dependent. Automated CT report generation offers a promising route to improving clinical efficiency, yet the field still lacks a generalizable CT report generation foundation model that supports multi-region reporting and remains robust across external real-world cohorts. Intrinsic inconsistencies in reporting style and diagnostic terminology across cohorts make naive joint training prone to noisy textual supervision, thereby limiting model generalizability. Here we present Astra, a generalizable CT report generation foundation model trained on 90,678 thoracoabdominal CT-report pairs (CTRgDB) with 353,671 abnormalities spanning eight organ systems. By harmonizing report style and further refining diagnostic consistency via reinforcement learning, Astra achieves style-consistent and diagnostically accurate report generation across diverse anatomical regions and institutions. Evaluating on CTRgDB and six external cohorts, Astra achieves state-of-the-art performance with a 44.1% average improvement in fine-grained diagnostic metrics (P<0.001). In real-world clinical workflows, Astra assistance accelerates chest report drafting by 29.6% and improves abdominal report completeness by 11.3% (P<0.001). Furthermore, Astra also demonstrates broad utility as a foundation for CT AI development, improving downstream diagnostic performance and scaling vision-language pretrain through high-quality report synthesis. Overall, Astra serves as a broadly accessible clinical assistant and a pivotal infrastructure for the next generation of AI-powered healthcare.

LGJul 26, 2022
A Survey of Explainable Graph Neural Networks: Taxonomy and Evaluation Metrics

Yiqiao Li, Jianlong Zhou, Sunny Verma et al.

Graph neural networks (GNNs) have demonstrated a significant boost in prediction performance on graph data. At the same time, the predictions made by these models are often hard to interpret. In that regard, many efforts have been made to explain the prediction mechanisms of these models from perspectives such as GNNExplainer, XGNN and PGExplainer. Although such works present systematic frameworks to interpret GNNs, a holistic review for explainable GNNs is unavailable. In this survey, we present a comprehensive review of explainability techniques developed for GNNs. We focus on explainable graph neural networks and categorize them based on the use of explainable methods. We further provide the common performance metrics for GNNs explanations and point out several future research directions.

CVAug 7, 2022
Cross-Skeleton Interaction Graph Aggregation Network for Representation Learning of Mouse Social Behaviour

Feixiang Zhou, Xinyu Yang, Fang Chen et al.

Automated social behaviour analysis of mice has become an increasingly popular research area in behavioural neuroscience. Recently, pose information (i.e., locations of keypoints or skeleton) has been used to interpret social behaviours of mice. Nevertheless, effective encoding and decoding of social interaction information underlying the keypoints of mice has been rarely investigated in the existing methods. In particular, it is challenging to model complex social interactions between mice due to highly deformable body shapes and ambiguous movement patterns. To deal with the interaction modelling problem, we here propose a Cross-Skeleton Interaction Graph Aggregation Network (CS-IGANet) to learn abundant dynamics of freely interacting mice, where a Cross-Skeleton Node-level Interaction module (CS-NLI) is used to model multi-level interactions (i.e., intra-, inter- and cross-skeleton interactions). Furthermore, we design a novel Interaction-Aware Transformer (IAT) to dynamically learn the graph-level representation of social behaviours and update the node-level representation, guided by our proposed interaction-aware self-attention mechanism. Finally, to enhance the representation ability of our model, an auxiliary self-supervised learning task is proposed for measuring the similarity between cross-skeleton nodes. Experimental results on the standard CRMI13-Skeleton and our PDMB-Skeleton datasets show that our proposed model outperforms several other state-of-the-art approaches.

IVJul 13, 2023
Full-resolution Lung Nodule Segmentation from Chest X-ray Images using Residual Encoder-Decoder Networks

Michael James Horry, Subrata Chakraborty, Biswajeet Pradhan et al.

Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies use down-sampled images and computationally expensive methods with unproven generalization. Instead, this study localizes lung nodules using efficient encoder-decoder neural networks that process full resolution images to avoid any signal loss resulting from down-sampling. Encoder-decoder networks are trained and tested using the JSRT lung nodule dataset. The networks are used to localize lung nodules from an independent external CXR dataset. Sensitivity and false positive rates are measured using an automated framework to eliminate any observer subjectivity. These experiments allow for the determination of the optimal network depth, image resolution and pre-processing pipeline for generalized lung nodule localization. We find that nodule localization is influenced by subtlety, with more subtle nodules being detected in earlier training epochs. Therefore, we propose a novel self-ensemble model from three consecutive epochs centered on the validation optimum. This ensemble achieved a sensitivity of 85% in 10-fold internal testing with false positives of 8 per image. A sensitivity of 81% is achieved at a false positive rate of 6 following morphological false positive reduction. This result is comparable to more computationally complex systems based on linear and spatial filtering, but with a sub-second inference time that is faster than other methods. The proposed algorithm achieved excellent generalization results against an external dataset with sensitivity of 77% at a false positive rate of 7.6.

LGMay 10, 2022
Incident duration prediction using a bi-level machine learning framework with outlier removal and intra-extra joint optimisation

Artur Grigorev, Adriana-Simona Mihaita, Seunghyeon Lee et al.

Predicting the duration of traffic incidents is a challenging task due to the stochastic nature of events. The ability to accurately predict how long accidents will last can provide significant benefits to both end-users in their route choice and traffic operation managers in handling of non-recurrent traffic congestion. This paper presents a novel bi-level machine learning framework enhanced with outlier removal and intra-extra joint optimisation for predicting the incident duration on three heterogeneous data sets collected for both arterial roads and motorways from Sydney, Australia and San-Francisco, U.S.A. Firstly, we use incident data logs to develop a binary classification prediction approach, which allows us to classify traffic incidents as short-term or long-term. We find the optimal threshold between short-term versus long-term traffic incident duration, targeting both class balance and prediction performance while also comparing the binary versus multi-class classification approaches. Secondly, for more granularity of the incident duration prediction to the minute level, we propose a new Intra-Extra Joint Optimisation algorithm (IEO-ML) which extends multiple baseline ML models tested against several regression scenarios across the data sets. Final results indicate that: a) 40-45 min is the best split threshold for identifying short versus long-term incidents and that these incidents should be modelled separately, b) our proposed IEO-ML approach significantly outperforms baseline ML models in $66\%$ of all cases showcasing its great potential for accurate incident duration prediction. Lastly, we evaluate the feature importance and show that time, location, incident type, incident reporting source and weather at among the top 10 critical factors which influence how long incidents will last.

FLU-DYNJun 15, 2022
Detection of magnetohydrodynamic waves by using machine learning

Fang Chen, Ravi Samtaney

Nonlinear wave interactions, such as shock refraction at an inclined density interface, in magnetohydrodynamic (MHD) lead to a plethora of wave patterns with myriad wave types. Identification of different types of MHD waves is an important and challenging task in such complex wave patterns. Moreover, owing to the multiplicity of solutions and their admissibility for different systems, especially for intermediate-type MHD shock waves, the identification of MHD wave types is complicated if one solely relies on the Rankine-Hugoniot jump conditions. MHD wave detection is further exacerbated by the unphysical smearing of discontinuous shock waves in numerical simulations. We present two MHD wave detection methods based on a convolutional neural network (CNN) which enables the classification of waves and identification of their locations. The first method separates the output into a regression (location prediction) and a classification problem assuming the number of waves for each training data is fixed. In the second method, the number of waves is not specified a priori and the algorithm, using only regression, predicts the waves' locations and classifies their types. The first fixed output model efficiently provides high precision and recall, the accuracy of the entire neural network achieved is up to 0.99, and the classification accuracy of some waves approaches unity. The second detection model has relatively lower performance, with more sensitivity to the setting of parameters, such as the number of grid cells N_{grid} and the thresholds of confidence score and class probability, etc. The proposed two methods demonstrate very strong potential to be applied for MHD wave detection in some complex wave structures and interactions.

CVApr 17, 2023
SRCNet: Seminal Representation Collaborative Network for Marine Oil Spill Segmentation

Fang Chen, Heiko Balzter, Peng Ren et al.

Effective oil spill segmentation in Synthetic Aperture Radar (SAR) images is critical for marine oil pollution cleanup, and proper image representation is helpful for accurate image segmentation. In this paper, we propose an effective oil spill image segmentation network named SRCNet by leveraging SAR image representation and the training for oil spill segmentation simultaneously. Specifically, our proposed segmentation network is constructed with a pair of deep neural nets with the collaboration of the seminal representation that describes SAR images, where one deep neural net is the generative net which strives to produce oil spill segmentation maps, and the other is the discriminative net which trys its best to distinguish between the produced and the true segmentations, and they thus built a two-player game. Particularly, the seminal representation exploited in our proposed SRCNet originates from SAR imagery, modelling with the internal characteristics of SAR images. Thus, in the training process, the collaborated seminal representation empowers the mapped generative net to produce accurate oil spill segmentation maps efficiently with small amount of training data, promoting the discriminative net reaching its optimal solution at a fast speed. Therefore, our proposed SRCNet operates effective oil spill segmentation in an economical and efficient manner. Additionally, to increase the segmentation capability of the proposed segmentation network in terms of accurately delineating oil spill details in SAR images, a regularisation term that penalises the segmentation loss is devised. This encourages our proposed SRCNet for accurately segmenting oil spill areas from SAR images. Empirical experimental evaluations from different metrics validate the effectiveness of our proposed SRCNet for oil spill image segmentation.

LGSep 29, 2023
ACGAN-GNNExplainer: Auxiliary Conditional Generative Explainer for Graph Neural Networks

Yiqiao Li, Jianlong Zhou, Yifei Dong et al.

Graph neural networks (GNNs) have proven their efficacy in a variety of real-world applications, but their underlying mechanisms remain a mystery. To address this challenge and enable reliable decision-making, many GNN explainers have been proposed in recent years. However, these methods often encounter limitations, including their dependence on specific instances, lack of generalizability to unseen graphs, producing potentially invalid explanations, and yielding inadequate fidelity. To overcome these limitations, we, in this paper, introduce the Auxiliary Classifier Generative Adversarial Network (ACGAN) into the field of GNN explanation and propose a new GNN explainer dubbed~\emph{ACGAN-GNNExplainer}. Our approach leverages a generator to produce explanations for the original input graphs while incorporating a discriminator to oversee the generation process, ensuring explanation fidelity and improving accuracy. Experimental evaluations conducted on both synthetic and real-world graph datasets demonstrate the superiority of our proposed method compared to other existing GNN explainers.

LGDec 30, 2022
GANExplainer: GAN-based Graph Neural Networks Explainer

Yiqiao Li, Jianlong Zhou, Boyuan Zheng et al.

With the rapid deployment of graph neural networks (GNNs) based techniques into a wide range of applications such as link prediction, node classification, and graph classification the explainability of GNNs has become an indispensable component for predictive and trustworthy decision-making. Thus, it is critical to explain why graph neural network (GNN) makes particular predictions for them to be believed in many applications. Some GNNs explainers have been proposed recently. However, they lack to generate accurate and real explanations. To mitigate these limitations, we propose GANExplainer, based on Generative Adversarial Network (GAN) architecture. GANExplainer is composed of a generator to create explanations and a discriminator to assist with the Generator development. We investigate the explanation accuracy of our models by comparing the performance of GANExplainer with other state-of-the-art methods. Our empirical results on synthetic datasets indicate that GANExplainer improves explanation accuracy by up to 35\% compared to its alternatives.

AIOct 27, 2023
The Innovation-to-Occupations Ontology: Linking Business Transformation Initiatives to Occupations and Skills

Daniela Elia, Fang Chen, Didar Zowghi et al.

The fast adoption of new technologies forces companies to continuously adapt their operations making it harder to predict workforce requirements. Several recent studies have attempted to predict the emergence of new roles and skills in the labour market from online job ads. This paper aims to present a novel ontology linking business transformation initiatives to occupations and an approach to automatically populating it by leveraging embeddings extracted from job ads and Wikipedia pages on business transformation and emerging technologies topics. To our knowledge, no previous research explicitly links business transformation initiatives, like the adoption of new technologies or the entry into new markets, to the roles needed. Our approach successfully matches occupations to transformation initiatives under ten different scenarios, five linked to technology adoption and five related to business. This framework presents an innovative approach to guide enterprises and educational institutions on the workforce requirements for specific business transformation initiatives.

IVMar 23, 2023
Confidence-Driven Deep Learning Framework for Early Detection of Knee Osteoarthritis

Zhe Wang, Aladine Chetouani, Yung Hsin Chen et al.

Knee Osteoarthritis (KOA) is a prevalent musculoskeletal disorder that severely impacts mobility and quality of life, particularly among older adults. Its diagnosis often relies on subjective assessments using the Kellgren-Lawrence (KL) grading system, leading to variability in clinical evaluations. To address these challenges, we propose a confidence-driven deep learning framework for early KOA detection, focusing on distinguishing KL-0 and KL-2 stages. The Siamese-based framework integrates a novel multi-level feature extraction architecture with a hybrid loss strategy. Specifically, multi-level Global Average Pooling (GAP) layers are employed to extract features from varying network depths, ensuring comprehensive feature representation, while the hybrid loss strategy partitions training samples into high-, medium-, and low-confidence subsets. Tailored loss functions are applied to improve model robustness and effectively handle uncertainty in annotations. Experimental results on the Osteoarthritis Initiative (OAI) dataset demonstrate that the proposed framework achieves competitive accuracy, sensitivity, and specificity, comparable to those of expert radiologists. Cohen's kappa values (k > 0.85)) confirm substantial agreement, while McNemar's test (p > 0.05) indicates no statistically significant differences between the model and radiologists. Additionally, Confidence distribution analysis reveals that the model emulates radiologists' decision-making patterns. These findings highlight the potential of the proposed approach to serve as an auxiliary diagnostic tool, enhancing early KOA detection and reducing clinical workload.

CVAug 12, 2022
Instance Image Retrieval by Learning Purely From Within the Dataset

Zhongyan Zhang, Lei Wang, Yang Wang et al.

Quality feature representation is key to instance image retrieval. To attain it, existing methods usually resort to a deep model pre-trained on benchmark datasets or even fine-tune the model with a task-dependent labelled auxiliary dataset. Although achieving promising results, this approach is restricted by two issues: 1) the domain gap between benchmark datasets and the dataset of a given retrieval task; 2) the required auxiliary dataset cannot be readily obtained. In light of this situation, this work looks into a different approach which has not been well investigated for instance image retrieval previously: {can we learn feature representation \textit{specific to} a given retrieval task in order to achieve excellent retrieval?} Our finding is encouraging. By adding an object proposal generator to generate image regions for self-supervised learning, the investigated approach can successfully learn feature representation specific to a given dataset for retrieval. This representation can be made even more effective by boosting it with image similarity information mined from the dataset. As experimentally validated, such a simple ``self-supervised learning + self-boosting'' approach can well compete with the relevant state-of-the-art retrieval methods. Ablation study is conducted to show the appealing properties of this approach and its limitation on generalisation across datasets.

LGJul 23, 2024
TransFeat-TPP: An Interpretable Deep Covariate Temporal Point Processes

Zizhuo Meng, Boyu Li, Xuhui Fan et al.

The classical temporal point process (TPP) constructs an intensity function by taking the occurrence times into account. Nevertheless, occurrence time may not be the only relevant factor, other contextual data, termed covariates, may also impact the event evolution. Incorporating such covariates into the model is beneficial, while distinguishing their relevance to the event dynamics is of great practical significance. In this work, we propose a Transformer-based covariate temporal point process (TransFeat-TPP) model to improve the interpretability of deep covariate-TPPs while maintaining powerful expressiveness. TransFeat-TPP can effectively model complex relationships between events and covariates, and provide enhanced interpretability by discerning the importance of various covariates. Experimental results on synthetic and real datasets demonstrate improved prediction accuracy and consistently interpretable feature importance when compared to existing deep covariate-TPPs.

NEJan 3, 2023
Genetic Imitation Learning by Reward Extrapolation

Boyuan Zheng, Jianlong Zhou, Fang Chen

Imitation learning demonstrates remarkable performance in various domains. However, imitation learning is also constrained by many prerequisites. The research community has done intensive research to alleviate these constraints, such as adding the stochastic policy to avoid unseen states, eliminating the need for action labels, and learning from the suboptimal demonstrations. Inspired by the natural reproduction process, we proposed a method called GenIL that integrates the Genetic Algorithm with imitation learning. The involvement of the Genetic Algorithm improves the data efficiency by reproducing trajectories with various returns and assists the model in estimating more accurate and compact reward function parameters. We tested GenIL in both Atari and Mujoco domains, and the result shows that it successfully outperforms the previous extrapolation methods over extrapolation accuracy, robustness, and overall policy performance when input data is limited.

LGJan 3, 2023
Explaining Imitation Learning through Frames

Boyuan Zheng, Jianlong Zhou, Chunjie Liu et al.

As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.

LGAug 14, 2024
Enhancing Adversarial Attacks via Parameter Adaptive Adversarial Attack

Zhibo Jin, Jiayu Zhang, Zhiyu Zhu et al.

In recent times, the swift evolution of adversarial attacks has captured widespread attention, particularly concerning their transferability and other performance attributes. These techniques are primarily executed at the sample level, frequently overlooking the intrinsic parameters of models. Such neglect suggests that the perturbations introduced in adversarial samples might have the potential for further reduction. Given the essence of adversarial attacks is to impair model integrity with minimal noise on original samples, exploring avenues to maximize the utility of such perturbations is imperative. Against this backdrop, we have delved into the complexities of adversarial attack algorithms, dissecting the adversarial process into two critical phases: the Directional Supervision Process (DSP) and the Directional Optimization Process (DOP). While DSP determines the direction of updates based on the current samples and model parameters, it has been observed that existing model parameters may not always be conducive to adversarial attacks. The impact of models on adversarial efficacy is often overlooked in current research, leading to the neglect of DSP. We propose that under certain conditions, fine-tuning model parameters can significantly enhance the quality of DSP. For the first time, we propose that under certain conditions, fine-tuning model parameters can significantly improve the quality of the DSP. We provide, for the first time, rigorous mathematical definitions and proofs for these conditions, and introduce multiple methods for fine-tuning model parameters within DSP. Our extensive experiments substantiate the effectiveness of the proposed P3A method. Our code is accessible at: https://anonymous.4open.science/r/P3A-A12C/

CVDec 16, 2025
Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

Jia Guo, Jiawei Du, Shengzhu Yang et al.

Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.

AIAug 22, 2024
Enhancing Transferability of Adversarial Attacks with GE-AdvGAN+: A Comprehensive Framework for Gradient Editing

Zhibo Jin, Jiayu Zhang, Zhiyu Zhu et al.

Transferable adversarial attacks pose significant threats to deep neural networks, particularly in black-box scenarios where internal model information is inaccessible. Studying adversarial attack methods helps advance the performance of defense mechanisms and explore model vulnerabilities. These methods can uncover and exploit weaknesses in models, promoting the development of more robust architectures. However, current methods for transferable attacks often come with substantial computational costs, limiting their deployment and application, especially in edge computing scenarios. Adversarial generative models, such as Generative Adversarial Networks (GANs), are characterized by their ability to generate samples without the need for retraining after an initial training phase. GE-AdvGAN, a recent method for transferable adversarial attacks, is based on this principle. In this paper, we propose a novel general framework for gradient editing-based transferable attacks, named GE-AdvGAN+, which integrates nearly all mainstream attack methods to enhance transferability while significantly reducing computational resource consumption. Our experiments demonstrate the compatibility and effectiveness of our framework. Compared to the baseline AdvGAN, our best-performing method, GE-AdvGAN++, achieves an average ASR improvement of 47.8. Additionally, it surpasses the latest competing algorithm, GE-AdvGAN, with an average ASR increase of 5.9. The framework also exhibits enhanced computational efficiency, achieving 2217.7 FPS, outperforming traditional methods such as BIM and MI-FGSM. The implementation code for our GE-AdvGAN+ framework is available at https://github.com/GEAdvGANP

LGAug 1, 2022
De-biased Representation Learning for Fairness with Unreliable Labels

Yixuan Zhang, Feng Zhou, Zhidong Li et al.

Removing bias while keeping all task-relevant information is challenging for fair representation learning methods since they would yield random or degenerate representations w.r.t. labels when the sensitive attributes correlate with labels. Existing works proposed to inject the label information into the learning procedure to overcome such issues. However, the assumption that the observed labels are clean is not always met. In fact, label bias is acknowledged as the primary source inducing discrimination. In other words, the fair pre-processing methods ignore the discrimination encoded in the labels either during the learning procedure or the evaluation stage. This contradiction puts a question mark on the fairness of the learned representations. To circumvent this issue, we explore the following question: \emph{Can we learn fair representations predictable to latent ideal fair labels given only access to unreliable labels?} In this work, we propose a \textbf{D}e-\textbf{B}iased \textbf{R}epresentation Learning for \textbf{F}airness (DBRF) framework which disentangles the sensitive information from non-sensitive attributes whilst keeping the learned representations predictable to ideal fair labels rather than observed biased ones. We formulate the de-biased learning framework through information-theoretic concepts such as mutual information and information bottleneck. The core concept is that DBRF advocates not to use unreliable labels for supervision when sensitive information benefits the prediction of unreliable labels. Experiment results over both synthetic and real-world data demonstrate that DBRF effectively learns de-biased representations towards ideal labels.

IVSep 19, 2024
Beyond Uncertainty Quantification: Learning Uncertainty for Trust-Informed Neural Network Decisions - A Case Study in COVID-19 Classification

Hassan Gharoun, Mohammad Sadegh Khorshidi, Fang Chen et al.

Reliable uncertainty quantification is critical in high-stakes applications, such as medical diagnosis, where confidently incorrect predictions can erode trust in automated decision-making systems. Traditional uncertainty quantification methods rely on a predefined confidence threshold to classify predictions as confident or uncertain. However, this approach assumes that predictions exceeding the threshold are trustworthy, while those below it are uncertain, without explicitly assessing the correctness of high-confidence predictions. As a result, confidently incorrect predictions may still occur, leading to misleading uncertainty assessments. To address this limitation, this study proposed an uncertainty-aware stacked neural network, which extends conventional uncertainty quantification by learning when predictions should be trusted. The framework consists of a two-tier model: the base model generates predictions with uncertainty estimates, while the meta-model learns to assign a trust flag, distinguishing confidently correct cases from those requiring expert review. The proposed approach is evaluated against the traditional threshold-based method across multiple confidence thresholds and pre-trained architectures using the COVIDx CXR-4 dataset. Results demonstrate that the proposed framework significantly reduces confidently incorrect predictions, offering a more trustworthy and efficient decision-support system for high-stakes domains.

CVSep 2, 2024
IVGF: The Fusion-Guided Infrared and Visible General Framework

Fangcen Liu, Chenqiang Gao, Fang Chen et al.

Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.

CLDec 16, 2024Code
LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

Zhuhao Wang, Yihua Sun, Zihan Li et al.

Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ``finding'' section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, input-agnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM's flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model's attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.

LGNov 8, 2024Code
RED: Residual Estimation Diffusion for Low-Dose PET Sinogram Reconstruction

Xingyu Ai, Bin Huang, Fang Chen et al.

Recent advances in diffusion models have demonstrated exceptional performance in generative tasks across vari-ous fields. In positron emission tomography (PET), the reduction in tracer dose leads to information loss in sino-grams. Using diffusion models to reconstruct missing in-formation can improve imaging quality. Traditional diffu-sion models effectively use Gaussian noise for image re-constructions. However, in low-dose PET reconstruction, Gaussian noise can worsen the already sparse data by introducing artifacts and inconsistencies. To address this issue, we propose a diffusion model named residual esti-mation diffusion (RED). From the perspective of diffusion mechanism, RED uses the residual between sinograms to replace Gaussian noise in diffusion process, respectively sets the low-dose and full-dose sinograms as the starting point and endpoint of reconstruction. This mechanism helps preserve the original information in the low-dose sinogram, thereby enhancing reconstruction reliability. From the perspective of data consistency, RED introduces a drift correction strategy to reduce accumulated prediction errors during the reverse process. Calibrating the inter-mediate results of reverse iterations helps maintain the data consistency and enhances the stability of reconstruc-tion process. Experimental results show that RED effec-tively improves the quality of low-dose sinograms as well as the reconstruction results. The code is available at: https://github.com/yqx7150/RED.

CVFeb 16, 2025Code
Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability

Zhiyu Zhu, Zhibo Jin, Jiayu Zhang et al.

The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at https://github.com/LMBTough/NIB.

IVFeb 26, 2023
Key-Exchange Convolutional Auto-Encoder for Data Augmentation in Early Knee Osteoarthritis Detection

Zhe Wang, Aladine Chetouani, Mohamed Jarraya et al.

Knee Osteoarthritis (KOA) is a common musculoskeletal condition that significantly affects mobility and quality of life, particularly in elderly populations. However, training deep learning models for early KOA classification is often hampered by the limited availability of annotated medical datasets, owing to the high costs and labour-intensive nature of data labelling. Traditional data augmentation techniques, while useful, rely on simple transformations and fail to introduce sufficient diversity into the dataset. To address these challenges, we propose the Key-Exchange Convolutional Auto-Encoder (KECAE) as an innovative Artificial Intelligence (AI)-based data augmentation strategy for early KOA classification. Our model employs a convolutional autoencoder with a novel key-exchange mechanism that generates synthetic images by selectively exchanging key pathological features between X-ray images, which not only diversifies the dataset but also ensures the clinical validity of the augmented data. A hybrid loss function is introduced to supervise feature learning and reconstruction, integrating multiple components, including reconstruction, supervision, and feature separation losses. Experimental results demonstrate that the KECAE-generated data significantly improve the performance of KOA classification models, with accuracy gains of up to 1.98% across various standard and state-of-the-art architectures. Furthermore, a clinical validation study involving expert radiologists confirms the anatomical plausibility and diagnostic realism of the synthetic outputs. These findings highlight the potential of KECAE as a robust tool for augmenting medical datasets in early KOA detection.

ROMay 7
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Yushan Liu, Peibo Sun, Shoujie Li et al.

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

IVJun 20, 2025Code
A Prior-Guided Joint Diffusion Model in Projection Domain for PET Tracer Conversion

Fang Chen, Weifeng Zhang, Xingyu Ai et al.

Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, the complexity of its synthesis process and constraints on transportation time have limited its clinical application. Among different forms of raw data acquired by the scanner, sinogram is a commonly used representation in PET imaging. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation errors during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET sinograms into 18F-DOPA PET sinograms. During inference, an initial synthetic 18F-DOPA PET sinogram is first generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process. Experimental results demonstrated that PJDM effectively improved both sinogram quality and the final synthetic outcomes. The code is available at: https://github.com/yqx7150/PJDM.

IVJun 18, 2025Code
Diffusion-based Counterfactual Augmentation: Towards Robust and Interpretable Knee Osteoarthritis Grading

Zhe Wang, Yuhua Ru, Aladine Chetouani et al.

Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by generating targeted counterfactual examples. The method navigates the latent space of a diffusion model using a Stochastic Differential Equation (SDE), governed by balancing a classifier-informed boundary drive with a manifold constraint. The resulting counterfactuals are then used within a self-corrective learning strategy to improve the classifier by focusing on its specific areas of uncertainty. Extensive experiments on the public Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) datasets demonstrate that this approach significantly improves classification accuracy across multiple model architectures. Furthermore, the method provides interpretability by visualizing minimal pathological changes and revealing that the learned latent space topology aligns with clinical knowledge of KOA progression. The DCA framework effectively converts model uncertainty into a robust training signal, offering a promising pathway to developing more accurate and trustworthy automated diagnostic systems. Our code is available at https://github.com/ZWang78/DCA.

LGMay 3, 2025Code
ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability

Zhiyu Zhu, Jiayu Zhang, Zhibo Jin et al.

Attribution algorithms are essential for enhancing the interpretability and trustworthiness of deep learning models by identifying key features driving model decisions. Existing frameworks, such as InterpretDL and OmniXAI, integrate multiple attribution methods but suffer from scalability limitations, high coupling, theoretical constraints, and lack of user-friendly implementations, hindering neural network transparency and interoperability. To address these challenges, we propose Attribution-Based Explainability (ABE), a unified framework that formalizes Fundamental Attribution Methods and integrates state-of-the-art attribution algorithms while ensuring compliance with attribution axioms. ABE enables researchers to develop novel attribution techniques and enhances interpretability through four customizable modules: Robustness, Interpretability, Validation, and Data & Model. This framework provides a scalable, extensible foundation for advancing attribution-based explainability and fostering transparent AI systems. Our code is available at: https://github.com/LMBTough/ABE-XAI.

IVApr 9, 2025Code
MoEDiff-SR: Mixture of Experts-Guided Diffusion Model for Region-Adaptive MRI Super-Resolution

Zhe Wang, Yuhua Ru, Aladine Chetouani et al.

Magnetic Resonance Imaging (MRI) at lower field strengths (e.g., 3T) suffers from limited spatial resolution, making it challenging to capture fine anatomical details essential for clinical diagnosis and neuroimaging research. To overcome this limitation, we propose MoEDiff-SR, a Mixture of Experts (MoE)-guided diffusion model for region-adaptive MRI Super-Resolution (SR). Unlike conventional diffusion-based SR models that apply a uniform denoising process across the entire image, MoEDiff-SR dynamically selects specialized denoising experts at a fine-grained token level, ensuring region-specific adaptation and enhanced SR performance. Specifically, our approach first employs a Transformer-based feature extractor to compute multi-scale patch embeddings, capturing both global structural information and local texture details. The extracted feature embeddings are then fed into an MoE gating network, which assigns adaptive weights to multiple diffusion-based denoisers, each specializing in different brain MRI characteristics, such as centrum semiovale, sulcal and gyral cortex, and grey-white matter junction. The final output is produced by aggregating the denoised results from these specialized experts according to dynamically assigned gating probabilities. Experimental results demonstrate that MoEDiff-SR outperforms existing state-of-the-art methods in terms of quantitative image quality metrics, perceptual fidelity, and computational efficiency. Difference maps from each expert further highlight their distinct specializations, confirming the effective region-specific denoising capability and the interpretability of expert contributions. Additionally, clinical evaluation validates its superior diagnostic capability in identifying subtle pathological features, emphasizing its practical relevance in clinical neuroimaging. Our code is available at https://github.com/ZWang78/MoEDiff-SR.

CVJun 6, 2024Code
ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Fang Chen, Gourav Datta, Mujahid Al Rafi et al.

The expansion of neural network sizes and the enhanced resolution of modern image sensors result in heightened memory and power demands to process modern computer vision models. In order to deploy these models in extremely resource-constrained edge devices, it is crucial to reduce their peak memory, which is the maximum memory consumed during the execution of a model. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision, including image classification and diffusion-based image generation. For image classification, our method yields 4x-5x theoretical peak memory reduction with less degradation in accuracy for most CNN-based architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation. Experiments demonstrate our method's superior performance compared to other feature-based and response-based distillation methods when applied to the same student network. The code is available at https://github.com/mengtang-lab/ReDistill.

IVJan 30, 2025Code
Distillation-Driven Diffusion Model for Multi-Scale MRI Super-Resolution: Make 1.5T MRI Great Again

Zhe Wang, Yuhua Ru, Fabian Bauer et al.

Magnetic Resonance Imaging (MRI) offers critical insights into microstructural details, however, the spatial resolution of standard 1.5T imaging systems is often limited. In contrast, 7T MRI provides significantly enhanced spatial resolution, enabling finer visualization of anatomical structures. Though this, the high cost and limited availability of 7T MRI hinder its widespread use in clinical settings. To address this challenge, a novel Super-Resolution (SR) model is proposed to generate 7T-like MRI from standard 1.5T MRI scans. Our approach leverages a diffusion-based architecture, incorporating gradient nonlinearity correction and bias field correction data from 7T imaging as guidance. Moreover, to improve deployability, a progressive distillation strategy is introduced. Specifically, the student model refines the 7T SR task with steps, leveraging feature maps from the inference phase of the teacher model as guidance, aiming to allow the student model to achieve progressively 7T SR performance with a smaller, deployable model size. Experimental results demonstrate that our baseline teacher model achieves state-of-the-art SR performance. The student model, while lightweight, sacrifices minimal performance. Furthermore, the student model is capable of accepting MRI inputs at varying resolutions without the need for retraining, significantly further enhancing deployment flexibility. The clinical relevance of our proposed method is validated using clinical data from Massachusetts General Hospital. Our code is available at https://github.com/ZWang78/SR.

AIDec 27, 2024Code
Attribution for Enhanced Explanation with Transferable Adversarial eXploration

Zhiyu Zhu, Jiayu Zhang, Zhibo Jin et al.

The interpretability of deep neural networks is crucial for understanding model decisions in various applications, including computer vision. AttEXplore++, an advanced framework built upon AttEXplore, enhances attribution by incorporating transferable adversarial attack methods such as MIG and GRA, significantly improving the accuracy and robustness of model explanations. We conduct extensive experiments on five models, including CNNs (Inception-v3, ResNet-50, VGG16) and vision transformers (MaxViT-T, ViT-B/16), using the ImageNet dataset. Our method achieves an average performance improvement of 7.57\% over AttEXplore and 32.62\% compared to other state-of-the-art interpretability algorithms. Using insertion and deletion scores as evaluation metrics, we show that adversarial transferability plays a vital role in enhancing attribution results. Furthermore, we explore the impact of randomness, perturbation rate, noise amplitude, and diversity probability on attribution performance, demonstrating that AttEXplore++ provides more stable and reliable explanations across various models. We release our code at: https://anonymous.4open.science/r/ATTEXPLOREP-8435/

AIOct 16, 2020Code
Deep-HOSeq: Deep Higher Order Sequence Fusion for Multimodal Sentiment Analysis

Sunny Verma, Jiwei Wang, Zhefeng Ge et al.

Multimodal sentiment analysis utilizes multiple heterogeneous modalities for sentiment classification. The recent multimodal fusion schemes customize LSTMs to discover intra-modal dynamics and design sophisticated attention mechanisms to discover the inter-modal dynamics from multimodal sequences. Although powerful, these schemes completely rely on attention mechanisms which is problematic due to two major drawbacks 1) deceptive attention masks, and 2) training dynamics. Nevertheless, strenuous efforts are required to optimize hyperparameters of these consolidate architectures, in particular their custom-designed LSTMs constrained by attention schemes. In this research, we first propose a common network to discover both intra-modal and inter-modal dynamics by utilizing basic LSTMs and tensor based convolution networks. We then propose unique networks to encapsulate temporal-granularity among the modalities which is essential while extracting information within asynchronous sequences. We then integrate these two kinds of information via a fusion layer and call our novel multimodal fusion scheme as Deep-HOSeq (Deep network with higher order Common and Unique Sequence information). The proposed Deep-HOSeq efficiently discovers all-important information from multimodal sequences and the effectiveness of utilizing both types of information is empirically demonstrated on CMU-MOSEI and CMU-MOSI benchmark datasets. The source code of our proposed Deep-HOSeq is and available at https://github.com/sverma88/Deep-HOSeq--ICDM-2020.

CVMay 23, 2024
Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang et al.

Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.

CVDec 12, 2025
RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection

Rongcheng Wu, Hao Zhu, Shiying Zhang et al.

Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.

CLMar 12, 2025
A Survey on Enhancing Causal Reasoning Ability of Large Language Models

Xin Li, Zhuo Cai, Shoujin Wang et al.

Large language models (LLMs) have recently shown remarkable performance in language tasks and beyond. However, due to their limited inherent causal reasoning ability, LLMs still face challenges in handling tasks that require robust causal reasoning ability, such as health-care and economic analysis. As a result, a growing body of research has focused on enhancing the causal reasoning ability of LLMs. Despite the booming research, there lacks a survey to well review the challenges, progress and future directions in this area. To bridge this significant gap, we systematically review literature on how to strengthen LLMs' causal reasoning ability in this paper. We start from the introduction of background and motivations of this topic, followed by the summarisation of key challenges in this area. Thereafter, we propose a novel taxonomy to systematically categorise existing methods, together with detailed comparisons within and between classes of methods. Furthermore, we summarise existing benchmarks and evaluation metrics for assessing LLMs' causal reasoning ability. Finally, we outline future research directions for this emerging field, offering insights and inspiration to researchers and practitioners in the area.

SINov 22, 2024
Can GNNs Learn Link Heuristics? A Concise Review and Evaluation of Link Prediction Methods

Shuming Liang, Yu Ding, Zhidong Li et al.

This paper explores the ability of Graph Neural Networks (GNNs) in learning various forms of information for link prediction, alongside a brief review of existing link prediction methods. Our analysis reveals that GNNs cannot effectively learn structural information related to the number of common neighbors between two nodes, primarily due to the nature of set-based pooling of the neighborhood aggregation scheme. Also, our extensive experiments indicate that trainable node embeddings can improve the performance of GNN-based link prediction models. Importantly, we observe that the denser the graph, the greater such the improvement. We attribute this to the characteristics of node embeddings, where the link state of each link sample could be encoded into the embeddings of nodes that are involved in the neighborhood aggregation of the two nodes in that link sample. In denser graphs, every node could have more opportunities to attend the neighborhood aggregation of other nodes and encode states of more link samples to its embedding, thus learning better node embeddings for link prediction. Lastly, we demonstrate that the insights gained from our research carry important implications in identifying the limitations of existing link prediction methods, which could guide the future development of more robust algorithms.

LGJan 11, 2024
Semantic-Preserving Feature Partitioning for Multi-View Ensemble Learning

Mohammad Sadegh Khorshidi, Navid Yazdanjue, Hassan Gharoun et al.

In machine learning, the exponential growth of data and the associated ``curse of dimensionality'' pose significant challenges, particularly with expansive yet sparse datasets. Addressing these challenges, multi-view ensemble learning (MEL) has emerged as a transformative approach, with feature partitioning (FP) playing a pivotal role in constructing artificial views for MEL. Our study introduces the Semantic-Preserving Feature Partitioning (SPFP) algorithm, a novel method grounded in information theory. The SPFP algorithm effectively partitions datasets into multiple semantically consistent views, enhancing the MEL process. Through extensive experiments on eight real-world datasets, ranging from high-dimensional with limited instances to low-dimensional with high instances, our method demonstrates notable efficacy. It maintains model accuracy while significantly improving uncertainty measures in scenarios where high generalization performance is achievable. Conversely, it retains uncertainty metrics while enhancing accuracy where high generalization accuracy is less attainable. An effect size analysis further reveals that the SPFP algorithm outperforms benchmark models by large effect size and reduces computational demands through effective dimensionality reduction. The substantial effect sizes observed in most experiments underscore the algorithm's significant improvements in model performance.

LGApr 1
Neural Federated Learning for Livestock Growth Prediction

Shoujin Wang, Mingze Ni, Wei Liu et al.

Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm's local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.

NESep 16, 2025
From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification

Mohammad Sadegh Khorshidi, Navid Yazdanjue, Hassan Gharoun et al.

We study symbolic surrogate modeling of frozen Transformer embeddings to obtain compact, auditable classifiers with calibrated probabilities. For five benchmarks (SST2G, 20NG, MNIST, CIFAR10, MSC17), embeddings from ModernBERT, DINOv2, and SigLIP are partitioned on the training set into disjoint, information-preserving views via semantic-preserving feature partitioning (SPFP). A cooperative multi-population genetic program (MEGP) then learns additive, closed-form logit programs over these views. Across 30 runs per dataset we report F1, AUC, log-loss, Brier, expected calibration error (ECE), and symbolic complexity; a canonical model is chosen by a one-standard-error rule on validation F1 with a parsimony tie-break. Temperature scaling fitted on validation yields substantial ECE reductions on test. The resulting surrogates achieve strong discrimination (up to F1 around 0.99 on MNIST, CIFAR10, MSC17; around 0.95 on SST2G), while 20NG remains most challenging. We provide reliability diagrams, dimension usage and overlap statistics, contribution-based importances, and global effect profiles (PDP and ALE), demonstrating faithful, cross-modal explanations grounded in explicit programs.

NESep 16, 2025
Multi-population Ensemble Genetic Programming via Cooperative Coevolution and Multi-view Learning for Classification

Mohammad Sadegh Khorshidi, Navid Yazdanjue, Hassan Gharoun et al.

This paper introduces Multi-population Ensemble Genetic Programming (MEGP), a computational intelligence framework that integrates cooperative coevolution and the multiview learning paradigm to address classification challenges in high-dimensional and heterogeneous feature spaces. MEGP decomposes the input space into conditionally independent feature subsets, enabling multiple subpopulations to evolve in parallel while interacting through a dynamic ensemble-based fitness mechanism. Each individual encodes multiple genes whose outputs are aggregated via a differentiable softmax-based weighting layer, enhancing both model interpretability and adaptive decision fusion. A hybrid selection mechanism incorporating both isolated and ensemble-level fitness promotes inter-population cooperation while preserving intra-population diversity. This dual-level evolutionary dynamic facilitates structured search exploration and reduces premature convergence. Experimental evaluations across eight benchmark datasets demonstrate that MEGP consistently outperforms a baseline GP model in terms of convergence behavior and generalization performance. Comprehensive statistical analyses validate significant improvements in Log-Loss, Precision, Recall, F1 score, and AUC. MEGP also exhibits robust diversity retention and accelerated fitness gains throughout evolution, highlighting its effectiveness for scalable, ensemble-driven evolutionary learning. By unifying population-based optimization, multi-view representation learning, and cooperative coevolution, MEGP contributes a structurally adaptive and interpretable framework that advances emerging directions in evolutionary machine learning.

CYJun 2, 2025
Explainable AI Systems Must Be Contestable: Here's How to Make It Happen

Catarina Moreira, Anna Palatkina, Dacia Braca et al.

As AI regulations around the world intensify their focus on system safety, contestability has become a mandatory, yet ill-defined, safeguard. In XAI, "contestability" remains an empty promise: no formal definition exists, no algorithm guarantees it, and practitioners lack concrete guidance to satisfy regulatory requirements. Grounded in a systematic literature review, this paper presents the first rigorous formal definition of contestability in explainable AI, directly aligned with stakeholder requirements and regulatory mandates. We introduce a modular framework of by-design and post-hoc mechanisms spanning human-centered interfaces, technical architectures, legal processes, and organizational workflows. To operationalize our framework, we propose the Contestability Assessment Scale, a composite metric built on more than twenty quantitative criteria. Through multiple case studies across diverse application domains, we reveal where state-of-the-art systems fall short and show how our framework drives targeted improvements. By converting contestability from regulatory theory into a practical framework, our work equips practitioners with the tools to embed genuine recourse and accountability into AI systems.

HCFeb 28, 2025
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

Bo Wang, Yiqiao Li, Jianlong Zhou et al.

EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

LGDec 15, 2024
Navigating Towards Fairness with Data Selection

Yixuan Zhang, Zhidong Li, Yang Wang et al.

Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.