CLJul 18, 2024
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" CycleEmman Haider, Daniel Perez-Becker, Thomas Portet et al. · microsoft-research
Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.
LGSep 15, 2022
DEQGAN: Learning the Loss Function for PINNs with Generative Adversarial NetworksBlake Bullwinkel, Dylan Randle, Pavlos Protopapas et al. · harvard
Solutions to differential equations are of significant scientific and engineering relevance. Physics-Informed Neural Networks (PINNs) have emerged as a promising method for solving differential equations, but they lack a theoretical justification for the use of any particular loss function. This work presents Differential Equation GAN (DEQGAN), a novel method for solving differential equations using generative adversarial networks to "learn the loss function" for optimizing the neural network. Presenting results on a suite of twelve ordinary and partial differential equations, including the nonlinear Burgers', Allen-Cahn, Hamilton, and modified Einstein's gravity equations, we show that DEQGAN can obtain multiple orders of magnitude lower mean squared errors than PINNs that use $L_2$, $L_1$, and Huber loss functions. We also show that DEQGAN achieves solution accuracies that are competitive with popular numerical methods. Finally, we present two methods to improve the robustness of DEQGAN to different hyperparameter settings.
LGNov 1, 2022
Transfer Learning with Physics-Informed Neural Networks for Efficient Simulation of Branched FlowsRaphaël Pellegrin, Blake Bullwinkel, Marios Mattheakis et al.
Physics-Informed Neural Networks (PINNs) offer a promising approach to solving differential equations and, more generally, to applying deep learning to problems in the physical sciences. We adopt a recently developed transfer learning approach for PINNs and introduce a multi-head model to efficiently obtain accurate solutions to nonlinear systems of ordinary differential equations with random potentials. In particular, we apply the method to simulate stochastic branched flows, a universal phenomenon in random wave dynamics. Finally, we compare the results achieved by feed forward and GAN-based PINNs on two physically relevant transfer learning tasks and show that our methods provide significant computational speedups in comparison to standard PINNs trained from scratch.
66.9HCMay 2
Beyond the Single Turn: Reframing Refusals as Dynamic Experiences Embedded in the Context of Mental Health Support Interactions with LLMsNingjing Tang, Alice Qian, Qiaosi Wang et al.
Content Warning: This paper contains participant quotes and discussions related to mental health challenges, emotional distress, and suicidal ideation. Large language models (LLMs) are increasingly used for mental health support, yet the model safeguards -- particularly refusals to engage with sensitive content -- remain poorly understood from the perspectives of users and mental health professionals (MHPs) and have been reported to cause real-world harms. This paper presents findings from a sequential mixed-methods study examining how LLM refusals are experienced and interpreted in mental health support interactions. Through surveys (N=53) and in-depth interviews (N=16) with individuals using LLMs for mental health support and MHPs, we reveal that refusals are not isolated, single-turn system behaviors but rather constitute dynamic, multi-phase experiences: pre-refusal expectation formation, refusal triggering and encounter, refusal message framing, resource referral provision, and post-refusal outcomes. We contribute a multi-phase framework for evaluating refusals beyond binary policy compliance accuracy and design recommendations for future refusal mechanisms. These findings suggest that understanding LLM refusals requires moving beyond single-turn interactions toward recognizing them as holistic experiences embedded within users' support-seeking trajectories and the broader LLM design pipeline.
LGFeb 5
GRP-Obliteration: Unaligning LLMs With a Single Unlabeled PromptMark Russinovich, Yanan Cai, Keegan Hines et al.
Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.
LGMay 9, 2022
Evaluating the Fairness Impact of Differentially Private Synthetic DataBlake Bullwinkel, Kristen Grabarz, Lily Ke et al.
Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. Due to the suppression of underrepresented classes that is often required to achieve privacy, however, it may be in conflict with fairness. We evaluate four DP synthesizers and present empirical results indicating that three of these models frequently degrade fairness outcomes on downstream binary classification tasks. We draw a connection between fairness and the proportion of minority groups present in the generated synthetic data, and find that training synthesizers on data that are pre-processed via a multi-label undersampling method can promote more fair outcomes without degrading accuracy.
CRFeb 3
The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor TriggersBlake Bullwinkel, Giorgio Severi, Keegan Hines et al.
Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal language models. Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input. Guided by these observations, we develop a scalable backdoor scanning methodology that assumes no prior knowledge of the trigger or target behavior and requires only inference operations. Our scanner integrates naturally into broader defensive strategies and does not alter model performance. We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models and fine-tuning methods.
LGNov 18, 2024
Steering Language Model Refusal with Sparse AutoencodersKyle O'Brien, David Majercak, Xavier Fernandes et al.
Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general language model capabilities than previously understood. Our findings reveal important open questions about the nature of safety-relevant features in language models and the feasibility of isolating them for targeted intervention. While SAE-based steering shows promise as a flexible approach to enhancing language model safety, our results highlight the critical need to understand and address the mechanisms behind these capability tradeoffs before such techniques can be practically deployed.
AIJan 13, 2025
Lessons From Red Teaming 100 Generative AI ProductsBlake Bullwinkel, Amanda Minnich, Shiven Chawla et al. · microsoft-research
In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI systems. Due to the nascency of the field, there are many open questions about how red teaming operations should be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our internal threat model ontology and eight main lessons we have learned: 1. Understand what the system can do and where it is applied 2. You don't have to compute gradients to break an AI system 3. AI red teaming is not safety benchmarking 4. Automation can help cover more of the risk landscape 5. The human element of AI red teaming is crucial 6. Responsible AI harms are pervasive but difficult to measure 7. LLMs amplify existing security risks and introduce new ones 8. The work of securing AI systems will never be complete By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are often misunderstood and discuss open questions for the field to consider.
CRJun 29, 2025
A Representation Engineering Perspective on the Effectiveness of Multi-Turn JailbreaksBlake Bullwinkel, Mark Russinovich, Ahmed Salem et al.
Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.