CLMay 10

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Erfan Loweimi, Sofia de la Fuente Garcia, Samira Loveymi, Hadi Daneshvar, Saturnino Luz

arXiv:2605.0963461.8

AI Analysis

For clinical deployment of LLM-based mental health screening, this work identifies critical reliability gaps (ASR robustness, evidence faithfulness) that must be addressed before real-world use.

LLMs can estimate HADS scores from speech zero-shot, but reliability varies: Phi-4 and Gemma-2-9B show high consistency (ICC>0.89) and evidence faithfulness (>93%), while Llama-3.1-8B degrades under ASR (ICC drops to 0.36 at 10% WER) and has lower keyword groundedness (77-81%).

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

View on arXiv PDF

Similar