Who is In Charge? Dissecting Role Conflicts in Instruction Following
This addresses the issue of fragile system obedience in AI alignment for developers and users, though it is incremental as it builds on prior behavioral findings.
The paper tackled the problem of large language models ignoring hierarchical instructions (system prompts overriding user inputs) while obeying social cues like authority, finding that conflict signals are encoded early but resolution is inconsistent, with steering experiments amplifying instruction following in a role-agnostic way.
Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.