CLCYJul 17, 2024

The Better Angels of Machine Personality: How Personality Relates to LLM Safety

arXiv:2407.12344v121 citationsh-index: 13
Originality Incremental advance
AI Analysis

This study addresses the problem of enhancing LLM safety for AI developers and users by linking personality traits to safety metrics, offering a novel perspective but with incremental improvements.

The paper investigates the relationship between personality traits and safety abilities in Large Language Models, finding that safety alignment increases certain traits and that editing personality can improve safety performance, with a 43% relative improvement in privacy and 10% in fairness when changing from ISTJ to ISTP.

Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes