CLMay 27, 2025

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal

arXiv:2505.21757v14.91 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the need for more realistic and balanced clinical agents that can adaptively switch between reactive and proactive behaviors, though it is incremental as it builds on existing fine-tuning methods with a novel conditioning approach.

The paper tackled the problem of inconsistent proactivity in large language models (LLMs) used as clinical agents, proposing BehaviorSFT, a training strategy that uses behavioral tokens to condition LLMs for dynamic behavioral selection, which improved overall Macro F1 to 97.3% on BehaviorBench and enhanced proactive task scores, such as from 95.0% to 96.5% for Qwen2.5-7B-Ins.

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

View on arXiv PDF

Similar