Bert Baumgaertner

2papers

2 Papers

12.1CLMay 20
Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

Lawhori Chakrabarti, Jennifer Johnson-Leung, Bert Baumgaertner et al.

Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

78.7CYMar 10
Ethical Implications of Training Deceptive AI

Jason Starace, Bert Baumgaertner, Terence Soule

Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi-agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People ethical framework: Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification follows a ``highest dimension wins'' approach, assigning one of four risk levels with cumulative safeguards ranging from standard documentation at DRL-1 to regulatory notification and third-party security audits at DRL-4. A dual-development mandate at DRL-3 and above requires that detection and mitigation methods be developed alongside any deceptive capability. We apply the framework to eight case studies spanning all four levels and demonstrate that ecological validity of the deceptive mechanism emerges as a consistent, non-independent indicator of classification level. The DRL framework is intended to fill the governance gap between regulated deployment and unstructured research, supporting both beneficial applications and defensive research under conditions where safeguards are proportional to the potential for harm.