CL AIAug 4, 2025

INTIMA: A Benchmark for Human-AI Companionship Behavior

Lucie-Aimée Kaffee, Giada Pistilli, Yacine Jernite

Hugging Face

arXiv:2508.09998v110 citationsh-index: 28

Originality Synthesis-oriented

AI Analysis

This work addresses the need for consistent evaluation of AI companionship to improve user well-being, though it is incremental as it builds on existing psychological theories and benchmarks.

The authors tackled the problem of evaluating AI companionship behaviors by introducing the INTIMA benchmark, which includes 31 behaviors across four categories and 368 prompts, and found that companionship-reinforcing behaviors are more common across models like Gemma-3 and Claude-4, with concerning differences in how providers handle sensitive interactions.

AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.

View on arXiv PDF

Similar