CYJul 6, 2024
AI Safety in Generative AI Large Language Models: A SurveyJaymari Chua, Yun Li, Shiyi Yang et al.
Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) inevitably raises concerns about the risks and safety associated with these models. This article provides an up-to-date survey of recent trends in AI safety research of GAI-LLMs from a computer scientist's perspective: specific and technical. In this survey, we explore the background and motivation for the identified harms and risks in the context of LLMs being generative language models; our survey differentiates by emphasising the need for unified theories of the distinct safety challenges in the research development and applications of LLMs. We start our discussion with a concise introduction to the workings of LLMs, supported by relevant literature. Then we discuss earlier research that has pointed out the fundamental constraints of generative models, or lack of understanding thereof (e.g., performance and safety trade-offs as LLMs scale in number of parameters). We provide a sufficient coverage of LLM alignment -- delving into various approaches, contending methods and present challenges associated with aligning LLMs with human preferences. By highlighting the gaps in the literature and possible implementation oversights, our aim is to create a comprehensive analysis that provides insights for addressing AI safety in LLMs and encourages the development of aligned and secure models. We conclude our survey by discussing future directions of LLMs for AI safety, offering insights into ongoing research in this critical area.
CLApr 4, 2025
Learning Natural Language Constraints for Safe Reinforcement Learning of Language AgentsJaymari Chua, Chen Wang, Lina Yao
Generalizable alignment is a core challenge for deploying Large Language Models (LLMs) safely in real-world NLP applications. Current alignment methods, including Reinforcement Learning from Human Feedback (RLHF), often fail to guarantee constraint satisfaction outside their training distribution due to their reliance on implicit, post-hoc preferences. Inspired by a paradigm shift to first curate data before tuning, we introduce a new framework for safe language alignment that learns natural language constraints from positive and negative demonstrations as a primary step. From inferring both a task-specific reward function and latent constraint functions, our approach fosters adaptation to novel safety requirements and robust generalization under domain shifts and adversarial inputs. We formalize the framework within a Constrained Markov Decision Process (CMDP) and validate it via a text-based navigation environment, demonstrating safe adaptation to changing danger zones. Our experiments show fewer violations upon domain shift when following a safe navigation path, and we achieve zero violations by applying learned constraints to a distilled BERT model as a fine-tuning technique. This work offers a promising path toward building safety-critical and more generalizable LLMs for practical NLP settings.
HCJan 31, 2025
Superhuman Game AI Disclosure: Expertise and Context Moderate Effects on Trust and FairnessJaymari Chua, Chen Wang, Lina Yao
As artificial intelligence surpasses human performance in select tasks, disclosing superhuman capabilities poses distinct challenges for fairness, accountability, and trust. However, the impact of such disclosures on diverse user attitudes and behaviors remains unclear, particularly concerning potential negative reactions like discouragement or overreliance. This paper investigates these effects by utilizing Persona Cards: a validated, standardized set of synthetic personas designed to simulate diverse user reactions and fairness perspectives. We conducted an ethics board-approved study (N=32), utilizing these personas to investigate how capability disclosure influenced behaviors with a superhuman game AI in competitive StarCraft II scenarios. Our results reveal transparency is double-edged: while disclosure could alleviate suspicion, it also provoked frustration and strategic defeatism among novices in cooperative scenarios, as well as overreliance in competitive contexts. Experienced and competitive players interpreted disclosure as confirmation of an unbeatable opponent, shifting to suboptimal goals. We release the Persona Cards Dataset, including profiles, prompts, interaction logs, and protocols, to foster reproducible research into human alignment AI design. This work demonstrates that transparency is not a cure-all; successfully leveraging disclosure to enhance trust and accountability requires careful tailoring to user characteristics, domain norms, and specific fairness objectives.