CLSep 25, 2025

Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

Guangliang Liu, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Marie Johnson

arXiv:2509.21456v3h-index: 5

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of degraded performance in language models when applying moral alignment for gender stereotypes, which is incremental as it critiques existing methods.

The paper investigates the performance trade-off in moral alignment for gender stereotype mitigation, finding that current fairness objectives fail to achieve an effective balance as downstream task performance degrades with increased overall forgetting, and general solutions are ineffective.

Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.

View on arXiv PDF

Similar