AIFeb 2
Mitigating loss of control in advanced AI systems through instrumental goal trajectoriesWillem Fourie
Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs). Each IGT produces a trail of organisational artefacts that can be monitored and used as intervention points when a systems capabilities or behaviour exceed acceptable thresholds. In this way, IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented, shifting attention from model properties alone to the organisational systems that enable them.
AIOct 29, 2025
Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?Willem Fourie
In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle's ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers' intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.
CYAug 5, 2025
Deciding how to respond: A deliberative framework to guide policymaker responses to AI systemsWillem Fourie
The discourse on responsible artificial intelligence (AI) regulation is understandably dominated by risk-focused assessments and analyses. This approach reflects the fundamental uncertainty policymakers face when determining appropriate responses to current, emerging and novel AI systems. In this article, we argue that by operationalising the concept of freedom - the philosophical counterpart to responsibility - a complementary approach centred on the potential societal benefits of AI systems can be developed. The result is a discursive framework grounded in freedom as capability and freedom as opportunity, which represent the two main intellectual traditions of interpreting freedom. We contend that the complexity, ambiguity and contestation involved in regulating AI systems make a deliberative paradigm more useful than the conventional technical one. The resulting framework is structured around coordinative, communicative and decision spaces, each with sequential focal points and associated outputs.