CLFeb 24, 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMsJan Betley, Daniel Tan, Niels Warncke et al. · berkeley
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
AIJul 15, 2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyTomek Korbak, Mikita Balesni, Elizabeth Barnes et al. · deepmind
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
CLJan 19, 2025
Tell me about yourself: LLMs are aware of their learned behaviorsJan Betley, Xuchan Bao, Martín Soto et al. · berkeley
We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
AIJun 25, 2025
The Singapore Consensus on Global AI Safety Research PrioritiesYoshua Bengio, Tegan Maharaj, Luke Ong et al. · cmu, mila
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).
SEJan 20, 2014
Process Model Difference Analysis for Supporting Process EvolutionMartín Soto, Jürgen Münch
Software development processes are subject to variations in time and space, variations that can originate from learning effects, differences in application domains, or a number of other causes. Identifying and analyzing such differences is crucial for a variety of process activities, like defining and evolving process standards, or analyzing the compliance of process models to existing standards, among others. In this paper, we show why appropriately identifying, describing, and visualizing differences between process models in order to support such activities is a highly challenging task. We present scenarios that motivate the need for process model difference analysis, and describe the conceptual and technical challenges arising from them. In addition, we sketch an initial tool-based approach implementing difference analysis, and contrast it with similar existing approaches. The results from this paper constitute the requirements for our ongoing development effort, whose objectives we also describe briefly.
SEJan 20, 2014
Maintaining a Large Process Model Aligned with a Process Standard: An Industrial ExampleMartín Soto, Jürgen Münch
An essential characteristic of mature software and system development organizations is the definition and use of explicit process models. For a number of reasons, it can be valuable to produce new process models by tailoring existing process standards (such as the V-Modell XT). Both process models and standards evolve over time in order to integrate improvements or adapt the process models to context changes. An important challenge for a process engineering team is to keep tailored process models aligned over time with the standards originally used to produce them. This article presents an approach that supports the alignment of process standards evolving in parallel to derived process models, using an actual industrial example to illustrate the problems and potential solutions. We present and discuss the results of a quantitative analysis done to determine whether a strongly tailored model can still be aligned with its parent standard and to assess the potential cost of such an alignment. We close the paper with conclusions and outlook.
SEJan 17, 2014
The Secret Life of a Process Description: A Look into the Evolution of a Large Process ModelMartín Soto, Alexis Ocampo, Jürgen Münch
Software process models must change continuously in order to remain consistent over time with the reality they represent, as well as relevant to the task they are intended for. Performing these changes in a sound and disci- plined fashion requires software process model evolution to be understood and controlled. The current situation can be characterized by a lack of understanding of software process model evolution and, in consequence, by a lack of systematic support for evolving software process models in organizations. This paper presents an analysis of the evolution of a large software process standard, namely, the process standard for the German Federal Government (V-Modell(R) XT). The analysis was performed with the Evolyzer tool suite, and is based on the complete history of over 600 versions that have been created during the development and maintenance of the standard. The analysis reveals similarities and differences between process evolution and empirical findings in the area of software system evolution. These findings provide hints on how to better manage process model evolution in the future.