Michael S. A. Graziano

LG
h-index3
6papers
80citations
Novelty51%
AI Score46

6 Papers

LGFeb 6Code
Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes et al.

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

HCNov 17, 2023
Chatbots as social companions: How people perceive consciousness, human likeness, and social health benefits in machines

Rose E. Guingrich, Michael S. A. Graziano

As artificial intelligence (AI) becomes more widespread, one question that arises is how human-AI interaction might impact human-human interaction. Chatbots, for example, are increasingly used as social companions, and while much is speculated, little is known empirically about how their use impacts human relationships. A common hypothesis is that relationships with companion chatbots are detrimental to social health by harming or replacing human interaction, but this hypothesis may be too simplistic, especially considering the social needs of users and the health of their preexisting human relationships. To understand how relationships with companion chatbots impact social health, we studied people who regularly used companion chatbots and people who did not use them. Contrary to expectations, companion chatbot users indicated that these relationships were beneficial to their social health, whereas non-users viewed them as harmful. Another common assumption is that people perceive conscious, humanlike AI as disturbing and threatening. Among both users and non-users, however, we found the opposite: perceiving companion chatbots as more conscious and humanlike correlated with more positive opinions and more pronounced social health benefits. Detailed accounts from users suggested that these humanlike chatbots may aid social health by supplying reliable and safe interactions, without necessarily harming human relationships, but this may depend on users' preexisting social needs and how they perceive both human likeness and mind in the chatbot.

LGJul 14, 2024
Unexpected Benefits of Self-Modeling in Neural Systems

Vickram N. Premakumar, Michael Vaiana, Florin Pop et al.

Self-models have been a topic of great interest for decades in studies of human cognition and more recently in machine learning. Yet what benefits do self-models confer? Here we show that when artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental way. To better perform the self-model task, the network learns to make itself simpler, more regularized, more parameter-efficient, and therefore more amenable to being predictively modeled. To test the hypothesis of self-regularizing through self-modeling, we used a range of network architectures performing three classification tasks across two modalities. In all cases, adding self-modeling caused a significant reduction in network complexity. The reduction was observed in two ways. First, the distribution of weights was narrower when self-modeling was present. Second, a measure of network complexity, the real log canonical threshold (RLCT), was smaller when self-modeling was present. Not only were measures of complexity reduced, but the reduction became more pronounced as greater training weight was placed on the auxiliary task of self-modeling. These results strongly support the hypothesis that self-modeling is more than simply a network learning to predict itself. The learning has a restructuring effect, reducing complexity and increasing parameter efficiency. This self-regularization may help explain some of the benefits of self-models reported in recent machine learning literature, as well as the adaptive value of self-models to biological systems. In particular, these findings may shed light on the possible interaction between the ability to model oneself and the ability to be more easily modeled by others in a social or cooperative context.

CLFeb 10
Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper, Alex McKenzie, Florin Pop et al.

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

HCSep 23, 2025
A Longitudinal Randomized Control Study of Companion Chatbot Use: Anthropomorphism and Its Mediating Role on Social Impacts

Rose E. Guingrich, Michael S. A. Graziano

Many Large Language Model (LLM) chatbots are designed and used for companionship, and people have reported forming friendships, mentorships, and romantic partnerships with them. Concerns that companion chatbots may harm or replace real human relationships have been raised, but whether and how these social consequences occur remains unclear. In the present longitudinal study ($N = 183$), participants were randomly assigned to a chatbot condition (text chat with a companion chatbot) or to a control condition (text-based word games) for 10 minutes a day for 21 days. Participants also completed four surveys during the 21 days and engaged in audio recorded interviews on day 1 and 21. Overall, social health and relationships were not significantly impacted by companion chatbot interactions across 21 days of use. However, a detailed analysis showed a different story. People who had a higher desire to socially connect also tended to anthropomorphize the chatbot more, attributing humanlike properties to it; and those who anthropomorphized the chatbot more also reported that talking to the chatbot had a greater impact on their social interactions and relationships with family and friends. Via a mediation analysis, our results suggest a key mechanism at work: the impact of human-AI interaction on human-human social outcomes is mediated by the extent to which people anthropomorphize the AI agent, which is in turn motivated by a desire to socially connect. In a world where the desire to socially connect is on the rise, this finding may be cause for concern.

LGNov 1, 2024
Testing Components of the Attention Schema Theory in Artificial Neural Networks

Kathryn T. Farrell, Kirsten Ziman, Michael S. A. Graziano

Growing evidence suggests that the brain uses an attention schema, or a simplified model of attention, to help control what it attends to. One proposed benefit of this model is to allow agents to model the attention states of other agents, and thus predict and interact with other agents. The effects of an attention schema may be examined in artificial agents. Although attention mechanisms in artificial agents are different from in biological brains, there may be some principles in common. In both cases, select features or representations are emphasized for better performance. Here, using neural networks with transformer attention mechanisms, we asked whether the addition of an attention schema affected the ability of agents to make judgements about and cooperate with each other. First, we found that an agent with an attention schema is better at categorizing the attention states of other agents (higher accuracy). Second, an agent with an attention schema develops a pattern of attention that is easier for other agents to categorize. Third, in a joint task where two agents must predict each other to paint a scene together, adding an attention schema improves performance. Finally, the performance improvements are not caused by a general increase in network complexity. Instead, improvement is specific to tasks involving judging, categorizing, or predicting the attention of other agents. These results support the hypothesis that an attention schema has computational properties beneficial to mutual interpretability and interactive behavior. We speculate that the same principles might pertain to biological attention and attention schemas in people.