HCMay 14

Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

arXiv:2605.1545575.9
Predicted impact top 6% in HC · last 90 daysOriginality Incremental advance
AI Analysis

For everyday users of chatbots, this work provides a novel interface to make LLM behavioral drift transparent, addressing the problem of opaque and unpredictable chatbot behavior.

The paper introduces multi-turn neural transparency, an interface that visualizes LLM internal activations to help users anticipate and recognize behavioral drift across conversation turns. In a user study (N=246), the visualization significantly improved users' ability to evaluate and anticipate trait expression (d = -0.34 to -0.49) and reduced overconfidence compared to no visualization.

Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ($R^2 \geq 0.9$) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE $\approx$ 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes