Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges
This work addresses the problem of understanding ethical reasoning in AI for researchers and developers, though it is incremental due to methodological limitations like sensitivity to surface features.
The study investigated whether large language models internally differentiate between ethical frameworks like deontology and utilitarianism, finding that they form distinct subspaces with asymmetric generalization patterns, such as deontology probes partially transferring to virtue scenarios while commonsense probes fail on justice.
When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.