Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
For developers and users of LLMs who need fine-grained, interpretable control over moral reasoning without degrading model competence.
The paper introduces Convergent-Divergent Routing and Dual Logit Calibration to steer LLMs toward desired ethical frameworks (e.g., utilitarian vs. deontological) at inference time, achieving reliable preference calibration while preserving general capabilities and outperforming recent baselines.
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.