SDAILGASOct 11, 2024

Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network

arXiv:2410.09289v22 citationsh-index: 23IEEE Transactions on Audio, Speech, and Language Processing
Originality Incremental advance
AI Analysis

This work addresses the need for more effective multimodal fusion in audio-based medical diagnosis, offering a generalizable method that could enhance early and non-invasive disease detection, though it appears incremental in improving existing fusion approaches.

The paper tackles the problem of limited fusion strategies in multimodal audio-based disease prediction by proposing a transformer-based hierarchical fusion network that integrates intra-modal and inter-modal fusion, achieving state-of-the-art performance in predicting COVID-19, Parkinson's disease, and pathological dysarthria.

Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes