Weilun Xu

SD
h-index7
7papers
16citations
Novelty41%
AI Score47

7 Papers

CVFeb 10, 2023
Artificial Intelligence System for Detection and Screening of Cardiac Abnormalities using Electrocardiogram Images

Deyun Zhang, Shijia Geng, Yang Zhou et al.

The artificial intelligence (AI) system has achieved expert-level performance in electrocardiogram (ECG) signal analysis. However, in underdeveloped countries or regions where the healthcare information system is imperfect, only paper ECGs can be provided. Analysis of real-world ECG images (photos or scans of paper ECGs) remains challenging due to complex environments or interference. In this study, we present an AI system developed to detect and screen cardiac abnormalities (CAs) from real-world ECG images. The system was evaluated on a large dataset of 52,357 patients from multiple regions and populations across the world. On the detection task, the AI system obtained area under the receiver operating curve (AUC) of 0.996 (hold-out test), 0.994 (external test 1), 0.984 (external test 2), and 0.979 (external test 3), respectively. Meanwhile, the detection results of AI system showed a strong correlation with the diagnosis of cardiologists (cardiologist 1 (R=0.794, p<1e-3), cardiologist 2 (R=0.812, p<1e-3)). On the screening task, the AI system achieved AUCs of 0.894 (hold-out test) and 0.850 (external test). The screening performance of the AI system was better than that of the cardiologists (AI system (0.846) vs. cardiologist 1 (0.520) vs. cardiologist 2 (0.480)). Our study demonstrates the feasibility of an accurate, objective, easy-to-use, fast, and low-cost AI system for CA detection and screening. The system has the potential to be used by healthcare professionals, caregivers, and general users to assess CAs based on real-world ECG images.

SPMar 14, 2022
Defending Against Adversarial Attack in ECG Classification with Adversarial Distillation Training

Jiahao Shao, Shijia Geng, Zhaoji Fu et al.

In clinics, doctors rely on electrocardiograms (ECGs) to assess severe cardiac disorders. Owing to the development of technology and the increase in health awareness, ECG signals are currently obtained by using medical and commercial devices. Deep neural networks (DNNs) can be used to analyze these signals because of their high accuracy rate. However, researchers have found that adversarial attacks can significantly reduce the accuracy of DNNs. Studies have been conducted to defend ECG-based DNNs against traditional adversarial attacks, such as projected gradient descent (PGD), and smooth adversarial perturbation (SAP) which targets ECG classification; however, to the best of our knowledge, no study has completely explored the defense against adversarial attacks targeting ECG classification. Thus, we did different experiments to explore the effects of defense methods against white-box adversarial attack and black-box adversarial attack targeting ECG classification, and we found that some common defense methods performed well against these attacks. Besides, we proposed a new defense method called Adversarial Distillation Training (ADT) which comes from defensive distillation and can effectively improve the generalization performance of DNNs. The results show that our method performed more effectively against adversarial attacks targeting on ECG classification than the other baseline methods, namely, adversarial training, defensive distillation, Jacob regularization, and noise-to-signal ratio regularization. Furthermore, we found that our method performed better against PGD attacks with low noise levels, which means that our method has stronger robustness.

LGMay 16
Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Weilun Xu

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

SPFeb 21, 2025Code
On-device Computation of Single-lead ECG Parameters for Real-time Remote Cardiac Health Assessment: A Real-world Validation Study

Sumei Fan, Deyun Zhang, Yue Wang et al.

Accurate, continuous out-of-hospital electrocardiogram (ECG) parameter measurement is vital for real-time cardiac health monitoring and telemedicine. On-device computation of single-lead ECG parameters enables timely assessment without reliance on centralized data processing, advancing personalized, ubiquitous cardiac care-yet comprehensive validation across heterogeneous real-world populations remains limited. This study validated the on-device algorithm FeatureDB (https://github.com/PKUDigitalHealth/FeatureDB) using two datasets: HeartVoice-ECG-lite (369 participants with single-lead ECGs annotated by two physicians) and PTB-XL/PTB-XL+ (21,354 patients with 12-lead ECGs and physicians' diagnostic annotations). FeatureDB computed PR, QT, and QTc intervals, with accuracy evaluated against physician annotations via mean absolute error (MAE), correlation analysis, and Bland-Altman analysis. Diagnostic performance for first-degree atrioventricular block (AVBI, PR-based) and long QT syndrome (LQT, QTc-based) was benchmarked against commercial 12-lead systems (12SL, Uni-G) and open-source algorithm Deli, using AUC, accuracy, sensitivity, and specificity. Results showed high concordance with expert annotations (Pearson correlations: 0.836-0.960), MAEs matching inter-observer variability, and minimal bias. AVBI AUC reached 0.787 (12SL: 0.859; Uni-G: 0.812; Deli: 0.501); LQT AUC was 0.684 (12SL: 0.716; Uni-G: 0.605; Deli: 0.569)-comparable to commercial tools and superior to open-source alternatives. FeatureDB delivers physician-level parameter accuracy and commercial-grade abnormality detection via single-lead devices, supporting scalable telemedicine, decentralized cardiac screening, and continuous monitoring in community and outpatient settings.

SDMar 27
Acoustic Overspecification in Electronic Dance Music Taxonomy

Weilun Xu, Tianhao Dai, Oscar Goudet et al.

Electronic Dance Music (EDM) classification typically relies on industry-defined taxonomies, with current supervised approaches naturally assuming the validity of prescribed subgenre labels. However, whether these commercial distinctions reflect genuine acoustic differences remains largely unexplored. In this paper, we propose an unsupervised approach to discover the natural acoustic structure of EDM independent of commercial labels. To address the historical lack of EDM-specific feature design in MIR, we systematically construct a tailored, interpretable acoustic feature space capturing the genre's defining production techniques, spectral textures, and layered rhythmic patterns. To ensure our findings reflect inherent acoustic structure rather than feature engineering artifacts, we validate our clustering against state-of-the-art pre-trained audio embeddings (MERT and CLAP). Across both our bespoke feature space and the pre-trained embeddings, clustering consistently identifies 20 or fewer natural acoustic families -- suggesting current commercial EDM taxonomy is acoustically overspecified by nearly one-half.

SDMar 27
Diachronic Modeling of Tonal Coherence on the Tonnetz Across Classical and Popular Repertoires

Weilun Xu, Edward Hall, Martin Rohrmeier

How do different musical traditions achieve tonal coherence? Most computational measures to date have analysed tonal coherence in terms of a single dimension, whereas a multi-dimensional analyses have not been sufficiently explored. We propose a new model drawing on the concept of the Tonnetz -- we define two partially independent measures: \emph{tonal focus}, the concentration of pitch content near a tonal center; and \emph{tonal connection}, the degree to which pitch content reflects structured intervallic pathways back to that center. Analyzing over 2,800 pieces from Western classical and popular traditions, we find that these traditions occupy overlapping yet distinguishable regions of the two-dimensional space. Popular music shows higher tonal focus, while classical music exhibits higher tonal connection. Our complementary measures ground the differences between different tonal styles in quantitative evidence, and offer interpretable dimensions for computational music analysis and controllable generation.

CLMar 24
Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu, Alexander Rusnak, Frederic Kaplan

When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.