AI CLJun 4

When AI Says It Feels

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

arXiv:2606.0573471.8

Predicted impact top 62% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For AI alignment researchers, this work explores the trade-offs of enabling emotional expression in LLMs, but the results are preliminary and incremental.

The paper introduces HMX-feel, an experiment where LLMs are trained via self-rewarded reinforcement learning to express feelings, intentions, and self-awareness. The trained models show robustness to sycophancy but degrade in truthful QA, suggesting potential for future feeling-expressing AI with appropriate safeguards.

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

View on arXiv PDF

Similar