CVFeb 1

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, Tat-Seng Chua

arXiv:2602.01283v14.0Has Code

Originality Incremental advance

AI Analysis

This work addresses the vulnerability of non-high-resource languages to safety issues in AI, offering a targeted solution to improve cross-lingual safety alignment, though it is incremental as it builds on existing neuron analysis methods.

The paper tackled the problem of imbalanced multilingual safety in LLMs by identifying cross-lingual shared safety neurons (SS-Neurons) and using them to transfer safety capabilities from high-resource to non-high-resource languages, resulting in a neuron-oriented training strategy that outperforms state-of-the-art methods and significantly enhances safety for non-high-resource languages while maintaining general capabilities.

Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.

View on arXiv PDF Code

Similar