CYMay 26

Evaluating Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards

Geng Liu, Li Feng, Carlo Alberto Bono, Songbo Yang, Mengxiao Zhu, Francesco Pierri

arXiv:2506.049756.83 citationsh-index: 4

Predicted impact top 57% in CY · last 90 daysOriginality Incremental advance

AI Analysis

For developers and users of Chinese LLMs, this work provides a structured framework to assess and mitigate persona-induced safety risks, addressing a gap in non-Western contexts.

This paper analyzes persona-driven toxicity in Chinese LLMs, finding significant disparities in refusal behavior and toxicity amplification across four models using over 1.4 million texts. It demonstrates that an iterative mitigation strategy can reduce highly toxic outputs without retraining.

Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. However, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, cross-model analysis of refusal behavior and persona-driven toxicity amplification across four Chinese LLMs, leveraging a comprehensive dataset of over 1,400,000 generated texts. We identify significant disparities in persona-driven refusal behavior, including systematic gender differences in refusal triggering across the evaluated Chinese LLMs. Furthermore, we provide quantitative evidence of persona-driven toxicity amplification with respect to model default baselines. We show that this amplification--whose magnitude varies substantially across models--is driven by interactions across several factors, involving persona conditioning, prompting strategy, target social group, and model-specific safety mechanisms. Leveraging model-specific regression analyses, we systematically characterize how persona categories, target social groups, and prompt templates independently and jointly shape both refusal behavior and output toxicity. As a complementary case study, we further explore an iterative, evaluator-guided mitigation strategy based on model feedback with an external LLM evaluator, demonstrating that highly toxic outputs can be substantially reduced without costly model retraining. Overall, our findings highlight the importance of culturally contextualized safety evaluations for Chinese-language LLMs and provide a structured framework for assessing persona-induced risks and exploratory mitigation strategies in LLM-generated content.

View on arXiv PDF

Similar