Regulating the Agency of LLM-based Agents
This addresses safety concerns for developers and regulators of AI systems, though it is incremental as it builds on existing representation engineering and agency concepts.
The paper tackles the risks of misalignment and loss of control in LLM-based agents by proposing to measure and control their agency, operationalized through dimensions like preference rigidity and goal persistence, and introduces regulatory tools such as testing protocols and agency limits.
As increasingly capable large language model (LLM)-based agents are developed, the potential harms caused by misalignment and loss of control grow correspondingly severe. To address these risks, we propose an approach that directly measures and controls the agency of these AI systems. We conceptualize the agency of LLM-based agents as a property independent of intelligence-related measures and consistent with the interdisciplinary literature on the concept of agency. We offer (1) agency as a system property operationalized along the dimensions of preference rigidity, independent operation, and goal persistence, (2) a representation engineering approach to the measurement and control of the agency of an LLM-based agent, and (3) regulatory tools enabled by this approach: mandated testing protocols, domain-specific agency limits, insurance frameworks that price risk based on agency, and agency ceilings to prevent societal-scale risks. We view our approach as a step toward reducing the risks that motivate the ``Scientist AI'' paradigm, while still capturing some of the benefits from limited agentic behavior.