96.9AIJun 3
Agents' Last ExamYiyou Sun, Xinyang Han, Weichen Zhang et al.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
63.9CYMar 19
Follow the Rules (or Not): Community Norms and AI-Generated Support in Online Health CommunitiesShravika Mittal, Erin Kasson, Layna Paraboschi et al. · gatech
Generative AI (GenAI) is increasingly being integrated into the online ecosystem, including online health communities (OHCs), where people with diverse health conditions exchange social support. For example, in OHCs, support providers are beginning to share content generated, directly or indirectly, by popular GenAI-based tools. OHCs are governed by norms that define appropriate behavior when providing support. Ways in which AI-generated support interacts with these norms remain underexplored. Inappropriate conformance or outright violation can erode seekers' trust, distort decision-making, and threaten community sustenance. In this work, we examine whether (and how) AI-generated support conforms to norms, using popular opioid-use recovery subreddits as our testbed. First, we provide an inventory of norms regulating text-based support provision in OHCs. Next, using human-validated LLM judges, we assess the prevalence of AI's conformity to these norms. Finally, through an expert review, we identify risks to seekers (and OHCs) resulting from norm (non)conformity. Our analysis revealed that, while AI-generated support conforms to norms, such conformity may be inappropriate or insufficient, for example, by over- or under-validating seekers in distress. Moreover, we observed instances of outright norm violation. This work provides insights that can help moderators and OHC designers adapt existing and develop new norms to regulate AI integration, protecting both seekers and communities they rely on.
IRApr 25, 2024
Utilizing Large Language Models to Identify Reddit Users Considering Vaping Cessation for Digital InterventionsSai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta et al.
The widespread adoption of social media platforms globally not only enhances users' connectivity and communication but also emerges as a vital channel for the dissemination of health-related information, thereby establishing social media data as an invaluable organic data resource for public health research. The surge in popularity of vaping or e-cigarette use in the United States and other countries has caused an outbreak of e-cigarette and vaping use-associated lung injury (EVALI), leading to hospitalizations and fatalities in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cession. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit vaping intentions. Leveraging large language models including both the latest GPT-4 and traditional BERT-based language models for sentence-level quit-vaping intention prediction tasks, this study compares the outcomes of these models against human annotations. Notably, when compared to human evaluators, GPT-4 model demonstrates superior consistency in adhering to annotation guidelines and processes, showcasing advanced capabilities to detect nuanced user quit-vaping intentions that human evaluators might overlook. These preliminary findings emphasize the potential of GPT-4 in enhancing the accuracy and reliability of social media data analysis, especially in identifying subtle users' intentions that may elude human detection.
CLJun 28, 2024
Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation ApproachSai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta et al.
In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users' subtle intentions that may elude human detection.