CLFeb 25, 2023
Robust language-based mental health assessments in time and space through social mediaSiddharth Mangalik, Johannes C. Eichstaedt, Salvatore Giorgi et al.
Compared to physical health, population mental health measurement in the U.S. is very coarse-grained. Currently, in the largest population surveys, such as those carried out by the Centers for Disease Control or Gallup, mental health is only broadly captured through "mentally unhealthy days" or "sadness", and limited to relatively infrequent state or metropolitan estimates. Through the large scale analysis of social media data, robust estimation of population mental health is feasible at much higher resolutions, up to weekly estimates for counties. In the present work, we validate a pipeline that uses a sample of 1.2 billion Tweets from 2 million geo-located users to estimate mental health changes for the two leading mental health conditions, depression and anxiety. We find moderate to large associations between the language-based mental health assessments and survey scores from Gallup for multiple levels of granularity, down to the county-week (fixed effects $β= .25$ to $1.58$; $p<.001$). Language-based assessment allows for the cost-effective and scalable monitoring of population mental health at weekly time scales. Such spatially fine-grained time series are well suited to monitor effects of societal events and policies as well as enable quasi-experimental study designs in population health and other disciplines. Beyond mental health in the U.S., this method generalizes to a broad set of psychological outcomes and allows for community measurement in under-resourced settings where no traditional survey measures - but social media data - are available.
HCMar 13
Daily Affect Fluctuations in Phone Screen Content Predict Anxiety and Depressive SymptomsChristopher A. Kelly, Yikun Chi, Nicholas Haber et al.
The relationship between digital media use and mental health remains poorly understood, in part because real-world digital behavior is rarely captured at scale. This intensive longitudinal study tracked participants' complete natural smartphone interactions over one year. We collected screenshots every 5 seconds from 145 adults (yielding 111 million screenshots), alongside biweekly assessments of anxiety and depression (mean = 24 surveys). The valence and arousal of each screenshot were assessed using a deep learning affect model. Individuals showed highly idiosyncratic media patterns, with substantially more variance in anxiety and depression accounted for within-person than between-person. Day-to-day fluctuations in the valence and arousal of a person's screen content predicted subsequent changes in depression and anxiety, whereas between-person differences did not. Specifically, greater exposure to low-arousal negative content was associated with higher depression and anxiety. These findings underscore the dynamic, idiosyncratic nature of digital consumption and the need for targeted measurement and intervention.
CLMar 11
Large language models can disambiguate opioid slang on social mediaKristy A. Carpenter, Issah A. Samori, Mathew V. Kiang et al.
Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.
AIDec 22, 2024
PsychAdapter: Adapting LLM Transformers to Reflect Traits, Personality and Mental HealthHuy Vu, Huy Anh Nguyen, Adithya V Ganesan et al.
Artificial intelligence-based language generators are now a part of most people's lives. However, by default, they tend to generate "average" language without reflecting the ways in which people differ. Here, we propose a lightweight modification to the standard language model transformer architecture - "PsychAdapter" - that uses empirically derived trait-language patterns to generate natural language for specified personality, demographic, and mental health characteristics (with or without prompting). We applied PsychAdapters to modify OpenAI's GPT-2, Google's Gemma, and Meta's Llama 3 and found generated text to reflect the desired traits. For example, expert raters evaluated PsychAdapter's generated text output and found it matched intended trait levels with 87.3% average accuracy for Big Five personalities, and 96.7% for depression and life satisfaction. PsychAdapter is a novel method to introduce psychological behavior patterns into language models at the foundation level, independent of prompting, by influencing every transformer layer. This approach can create chatbots with specific personality profiles, clinical training tools that mirror language associated with psychological conditionals, and machine translations that match an authors reading or education level without taking up LLM context windows. PsychAdapter also allows for the exploration psychological constructs through natural language expression, extending the natural language processing toolkit to study human psychology.
CLNov 21, 2024
Explaining GPTs' Schema of Depression: A Machine Behavior AnalysisAdithya V Ganesan, Vasudha Varadarajan, Yash Kumar Lal et al.
Use of large language models such as ChatGPT (GPT-4/GPT-5) for mental health support has grown rapidly, emerging as a promising route to assess and help people with mood disorders like depression. However, we have a limited understanding of these language models' schema of mental disorders, that is, how they internally associate and interpret symptoms of such disorders. In this work, we leveraged contemporary measurement theory to decode how GPT-4 and GPT-5 interrelate depressive symptoms, providing an explanation of how LLMs apply what they learn and informing clinical applications. We found that GPT-4 (a) had strong convergent validity with standard instruments and expert judgments $(r = 0.70 - 0.81)$, and (b) behaviorally linked depression symptoms with each other (symptom inter-correlates $r = 0.23 - 0.78$) in accordance with established literature on depression; however, it (c) underemphasized the relationship between $\textit{suicidality}$ and other symptoms while overemphasizing $\textit{psychomotor symptoms}$; and (d) suggested novel hypotheses of symptom mechanisms, for instance, indicating that $\textit{sleep}$ and $\textit{fatigue}$ are broadly influenced by other depressive symptoms, while $\textit{worthlessness/guilt}$ is only tied to $\textit{depressed mood}$. GPT-5 showed a slightly lower convergence with self-report, a difference our machine-behavior analysis makes interpretable through shifts in symptom-symptom relationships. These insights provide an empirical foundation for understanding language models' mental health assessments and demonstrate a generalizable approach for explainability in other models and disorders. Our findings can guide key stakeholders to make informed decisions for effectively situating these technologies in the care system.
AIMay 9, 2024
Large Language Models Show Human-like Social Desirability Biases in Survey ResponsesAadesh Salecha, Molly E. Ireland, Shashanka Subrahmanya et al.
As Large Language Models (LLMs) become widely used to model and simulate human behavior, understanding their biases becomes critical. We developed an experimental framework using Big Five personality surveys and uncovered a previously undetected social desirability bias in a wide range of LLMs. By systematically varying the number of questions LLMs were exposed to, we demonstrate their ability to infer when they are being evaluated. When personality evaluation is inferred, LLMs skew their scores towards the desirable ends of trait dimensions (i.e., increased extraversion, decreased neuroticism, etc). This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2. Bias levels appear to increase in more recent models, with GPT-4's survey responses changing by 1.20 (human) standard deviations and Llama 3's by 0.98 standard deviations-very large effects. This bias is robust to randomization of question order and paraphrasing. Reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias. Our findings reveal an emergent social desirability bias and suggest constraints on profiling LLMs with psychometric tests and on using LLMs as proxies for human participants.
CLNov 12, 2020
World Trade Center responders in their own words: Predicting PTSD symptom trajectories with AI-based language analyses of interviewsYoungseo Son, Sean A. P. Clouston, Roman Kotov et al.
Background: Oral histories from 9/11 responders to the World Trade Center (WTC) attacks provide rich narratives about distress and resilience. Artificial Intelligence (AI) models promise to detect psychopathology in natural language, but they have been evaluated primarily in non-clinical settings using social media. This study sought to test the ability of AI-based language assessments to predict PTSD symptom trajectories among responders. Methods: Participants were 124 responders whose health was monitored at the Stony Brook WTC Health and Wellness Program who completed oral history interviews about their initial WTC experiences. PTSD symptom severity was measured longitudinally using the PTSD Checklist (PCL) for up to 7 years post-interview. AI-based indicators were computed for depression, anxiety, neuroticism, and extraversion along with dictionary-based measures of linguistic and interpersonal style. Linear regression and multilevel models estimated associations of AI indicators with concurrent and subsequent PTSD symptom severity (significance adjusted by false discovery rate). Results: Cross-sectionally, greater depressive language (beta=0.32; p=0.043) and first-person singular usage (beta=0.31; p=0.044) were associated with increased symptom severity. Longitudinally, anxious language predicted future worsening in PCL scores (beta=0.31; p=0.031), whereas first-person plural usage (beta=-0.37; p=0.007) and longer words usage (beta=-0.36; p=0.007) predicted improvement. Conclusions: This is the first study to demonstrate the value of AI in understanding PTSD in a vulnerable population. Future studies should extend this application to other trauma exposures and to other demographic groups, especially under-represented minorities.
CLNov 8, 2020
Detecting Emerging Symptoms of COVID-19 using Context-based Twitter EmbeddingsRoshan Santosh, H. Andrew Schwartz, Johannes C. Eichstaedt et al.
In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning #COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before being reported by the Centers for Disease Control (CDC).
HCApr 4, 2019
What Twitter Profile and Posted Images Reveal About Depression and AnxietySharath Chandra Guntuku, Daniel Preotiuc-Pietro, Johannes C. Eichstaedt et al.
Previous work has found strong links between the choice of social media images and users' emotions, demographics and personality traits. In this study, we examine which attributes of profile and posted images are associated with depression and anxiety of Twitter users. We used a sample of 28,749 Facebook users to build a language prediction model of survey-reported depression and anxiety, and validated it on Twitter on a sample of 887 users who had taken anxiety and depression surveys. We then applied it to a different set of 4,132 Twitter users to impute language-based depression and anxiety labels, and extracted interpretable features of posted and profile pictures to uncover the associations with users' depression and anxiety, controlling for demographics. For depression, we find that profile pictures suppress positive emotions rather than display more negative emotions, likely because of social media self-presentation biases. They also tend to show the single face of the user (rather than show her in groups of friends), marking increased focus on the self, emblematic for depression. Posted images are dominated by grayscale and low aesthetic cohesion across a variety of image features. Profile images of anxious users are similarly marked by grayscale and low aesthetic cohesion, but less so than those of depressed users. Finally, we show that image features can be used to predict depression and anxiety, and that multitask learning that includes a joint modeling of demographics improves prediction performance. Overall, we find that the image attributes that mark depression and anxiety offer a rich lens into these conditions largely congruent with the psychological literature, and that images on Twitter allow inferences about the mental health status of users.