CLMay 7
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural SensitivityDasol Choi, Eugenia Kim, Jaewon Noh et al.
Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.
AIMay 12Code
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language ModelsEugenia Kim, Ioana Tanase, Christina Mallon
General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.
NAJul 15, 2018
Convergence of a mass-lumped finite element method for the Landau-Lifshitz equationEugenia Kim, Jon Wilkening
The dynamics of the magnetic distribution in a ferromagnetic material is governed by the Landau-Lifshitz equation, which is a nonlinear geometric dispersive equation with a nonconvex constraint that requires the magnetization to remain of unit length throughout the domain. In this article, we present a mass-lumped finite element method for the Landau-Lifshitz equation. This method preserves the nonconvex constraint at each node of the finite element mesh, and is energy nonincreasing. We show that the numerical solution of our method for the Landau-Lifshitz equation converges to a weak solution of the Landau-Lifshitz-Gilbert equation using a simple proof technique that cancels out the product of weakly convergent sequences. Numerical tests for both explicit and implicit versions of the method on a unit square with periodic boundary conditions are provided for structured and unstructured meshes.
AIJan 26
Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety TestingKiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy et al.
Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $α= -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
CYMar 17
From Risk Avoidance to User Empowerment in AI Mental Health Crisis SupportBenjamin Kaveladze, Arka Ghosh, Leah Ajmani et al.
People experiencing mental health crises frequently turn to open-ended generative AI (GenAI) chatbots for support. However, rather than providing immediate assistance, some GenAI chatbots are designed to respond to crisis situations in ways that minimize their developers' liability, primarily through avoidance (e.g., refusing to engage beyond templated referrals to crisis hotlines). Withholding crisis support in these cases may harm users who have no viable alternatives and reduce their motivation to seek further help. At scale, this avoidant design could undermine population mental health. We propose empowerment-oriented design principles for AI crisis support, informed by community helper models. As an initial touchpoint in help-seeking, AI chatbots can act as a supportive bridge to de-escalate crises and connect users to more reliable care. Coordination between AI developers and regulators can enable a better balance of risk mitigation and user empowerment in AI crisis support.
HCDec 29, 2025
Seeking Late Night Life Lines: Experiences of Conversational AI Use in Mental Health CrisisLeah Hope Ajmani, Arka Ghosh, Benjamin Kaveladze et al.
Online, people often recount their experiences turning to conversational AI agents (e.g., ChatGPT, Claude, Copilot) for mental health support -- going so far as to replace their therapists. These anecdotes suggest that AI agents have great potential to offer accessible mental health support. However, it's unclear how to meet this potential in extreme mental health crisis use cases. In this work, we explore the first-person experience of turning to a conversational AI agent in a mental health crisis. From a testimonial survey (n = 53) of lived experiences, we find that people use AI agents to fill the in-between spaces of human support; they turn to AI due to lack of access to mental health professionals or fears of burdening others. At the same time, our interviews with mental health experts (n = 16) suggest that human-human connection is an essential positive action when managing a mental health crisis. Using the stages of change model, our results suggest that a responsible AI crisis intervention is one that increases the user's preparedness to take a positive action while de-escalating any intended negative action. We discuss the implications of designing conversational AI agents as bridges towards human-human connection rather than ends in themselves.
AIJan 13, 2025
Lessons From Red Teaming 100 Generative AI ProductsBlake Bullwinkel, Amanda Minnich, Shiven Chawla et al. · microsoft-research
In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI systems. Due to the nascency of the field, there are many open questions about how red teaming operations should be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our internal threat model ontology and eight main lessons we have learned: 1. Understand what the system can do and where it is applied 2. You don't have to compute gradients to break an AI system 3. AI red teaming is not safety benchmarking 4. Automation can help cover more of the risk landscape 5. The human element of AI red teaming is crucial 6. Responsible AI harms are pervasive but difficult to measure 7. LLMs amplify existing security risks and introduce new ones 8. The work of securing AI systems will never be complete By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are often misunderstood and discuss open questions for the field to consider.
CRJun 29, 2025
A Representation Engineering Perspective on the Effectiveness of Multi-Turn JailbreaksBlake Bullwinkel, Mark Russinovich, Ahmed Salem et al.
Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.
NAAug 25, 2016
The mimetic finite difference method for the Landau-Lifshitz equationEugenia Kim, Konstantin Lipnikov
The Landau-Lifshitz equation describes the dynamics of the magnetization inside ferromagnetic materials. This equation is highly nonlinear and has a non-convex constraint (the magnitude of the magnetization is constant) which pose interesting challenges in developing numerical methods. We develop and analyze explicit and implicit mimetic finite difference schemes for this equation. These schemes work on general polytopal meshes which provide enormous flexibility to model magnetic devices with various shapes. A projection on the unit sphere is used to preserve the magnitude of the magnetization. We also provide a proof that shows the exchange energy is decreasing in certain conditions. The developed schemes are tested on general meshes that include distorted and randomized meshes. The numerical experiments include a test proposed by the National Institute of Standard and Technology and a test showing formation of domain wall structures in a thin film.