SDLGASMay 27, 2025

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

arXiv:2505.20745v23 citationsh-index: 10INTERSPEECH
Originality Synthesis-oriented
AI Analysis

This work addresses heart rate estimation from heart sounds for medical applications, but it is incremental as it builds on existing foundation models and benchmarks.

The study investigated whether pre-trained acoustic foundation models encode auscultation data for heart rate estimation, finding that representations from an in-house CLAP model outperformed a baseline method with a lower mean absolute error.

Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes