Kristen Johnson

CL
h-index16
5papers
63citations
Novelty42%
AI Score42

5 Papers

CLMar 13
Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models

Bocheng Chen, Xi Chen, Han Zi et al.

Moral sensitivity is fundamental to human moral competence, as it guides individuals in regulating everyday behavior. Although many approaches seek to align large language models (LLMs) with human moral values, how to enable them morally sensitive has been extremely challenging. In this paper, we take a step toward answering the question: how can we enhance moral sensitivity in LLMs? Specifically, we propose two pragmatic inference methods that faciliate LLMs to diagnose morally benign and hazardous input and correct moral errors, whereby enhancing LLMs' moral sensitivity. A central strength of our pragmatic inference methods is their unified perspective: instead of modeling moral discourses across semantically diverse and complex surface forms, they offer a principled perspective for designing pragmatic inference procedures grounded in their inferential loads. Empirical evidence demonstrates that our pragmatic methods can enhance moral sensitivity in LLMs and achieves strong performance on representative morality-relevant benchmarks.

CLFeb 3, 2024
A Survey to Recent Progress Towards Understanding In-Context Learning

Haitao Mao, Guangliang Liu, Yao Ma et al.

In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gradient updates. Despite encouragingly empirical success, the underlying mechanism of ICL remains unclear. Existing research remains ambiguous with various viewpoints, utilizing intuition-driven and ad-hoc technical solutions to interpret ICL. In this paper, we leverage a data generation perspective to reinterpret recent efforts from a systematic angle, demonstrating the potential broader usage of these popular technical solutions. For a conceptual definition, we rigorously adopt the terms of skill recognition and skill learning. Skill recognition selects one learned data generation function previously seen during pre-training while skill learning can learn new data generation functions from in-context data. Furthermore, we provide insights into the strengths and weaknesses of both abilities, emphasizing their commonalities through the perspective of data generation. This analysis suggests potential directions for future research.

CLSep 28, 2025
Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Guangliang Liu, Xi Chen, Bocheng Chen et al.

Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs' generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

CLJun 6, 2024
Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness

Guangliang Liu, Milad Afshari, Xitong Zhang et al.

While task-agnostic debiasing provides notable generalizability and reduced reliance on downstream data, its impact on language modeling ability and the risk of relearning social biases from downstream task-specific data remain as the two most significant challenges when debiasing Pretrained Language Models (PLMs). The impact on language modeling ability can be alleviated given a high-quality and long-contextualized debiasing corpus, but there remains a deficiency in understanding the specifics of relearning biases. We empirically ascertain that the effectiveness of task-agnostic debiasing hinges on the quantitative bias level of both the task-specific data used for downstream applications and the debiased model. We empirically show that the lower bound of the bias level of the downstream fine-tuned model can be approximated by the bias level of the debiased model, in most practical cases. To gain more in-depth understanding about how the parameters of PLMs change during fine-tuning due to the forgetting issue of PLMs, we propose a novel framework which can Propagate Socially-fair Debiasing to Downstream Fine-tuning, ProSocialTuning. Our proposed framework can push the fine-tuned model to approach the bias lower bound during downstream fine-tuning, indicating that the ineffectiveness of debiasing can be alleviated by overcoming the forgetting issue through regularizing successfully debiased attention heads based on the PLMs' bias levels from stages of pretraining and debiasing.

CLJun 4, 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

Guangliang Liu, Haitao Mao, Bochuan Cao et al.

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only the task's goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. In this paper, we unveil that intrinsic self-correction can be progressively improved, allowing it to approach a converged state. Our findings are verified in: (1) the scenario of multi-round question answering, by comprehensively demonstrating that intrinsic self-correction can progressively introduce performance gains through iterative interactions, ultimately converging to stable performance; and (2) the context of intrinsic self-correction for enhanced morality, in which we provide empirical evidence that iteratively applying instructions reduces model uncertainty towards convergence, which then leads to convergence of both the calibration error and self-correction performance, ultimately resulting in a stable state of intrinsic self-correction. Furthermore, we introduce a mathematical formulation and a simulation task indicating that the latent concepts activated by self-correction instructions drive the reduction of model uncertainty. Based on our experimental results and analysis of the convergence of intrinsic self-correction, we reveal its underlying mechanism: consistent injected instructions reduce model uncertainty which yields converged, improved performance.