CL CYJul 29, 2025

HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, Shuai Zhao

arXiv:2507.21815v1h-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses the safety and accuracy of LLMs in a critical public health domain for people who use drugs, but it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the problem of evaluating large language models (LLMs) for providing harm reduction information to people who use drugs, finding that state-of-the-art LLMs often struggle with accuracy and pose severe safety risks.

Millions of individuals' well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM's accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.

View on arXiv PDF

Similar