"Mirror" Language AI Models of Depression are Criterion-Contaminated
This addresses a methodological issue for researchers and clinicians using AI in psychological assessment, highlighting the need for more valid approaches to avoid criterion contamination.
The study tackled the problem of inflated depression prediction accuracy in language AI models by comparing 'Mirror' models that use assessment responses to predict scores with 'Non-Mirror' models using external language, finding that both types showed large prediction sizes and similar correlations with other symptoms, indicating bias in Mirror models.
Recent studies show near-perfect language-based predictions of depression scores (R2 = .70), but these "Mirror" models rely on language responses directly from depression assessments to predict depression assessment scores. These methods suffer from criterion contamination that inflate prediction estimates. We compare "Mirror" models to "Non-Mirror" models, which use other external language to predict depression scores. 110 participants completed both structured diagnostic (Mirror condition) and life history (Non-Mirror condition) interviews. LLMs were prompted to predict diagnostic depression scores. As expected, Mirror models were near-perfect. However, Non-Mirror models also displayed prediction sizes considered large in psychology. Further, both Mirror and Non-Mirror predictions correlated with other questionnaire-based depression symptoms at similar sizes, suggesting bias in Mirror models. Topic modeling revealed different theme structures across model types. As language models for depression continue to evolve, incorporating Non-Mirror approaches may support more valid and clinically useful language-based AI applications in psychological assessment.