An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
This work addresses the need for more clinically applicable and interpretable AI tools for early depression screening, though it is incremental as it builds on existing Audio Spectrogram Transformer methods.
The researchers tackled the problem of improving speech-based depression detection by proposing an interpretable foundation model that uses long-duration speech instead of short segments, resulting in outperformance over a segment-level model and identification of reduced loudness and F0 as relevant acoustic features.
Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.