CLJan 30

Layer-wise Swapping for Generalizable Multilingual Safety

arXiv:2601.22620v21 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses safety alignment for low-resource languages, an incremental improvement over existing English-centric methods.

The paper tackles the problem of safety risks in low-resource languages for Large Language Models by proposing a layer-swapping method to transfer safety alignment from English to these languages without extra training, achieving comparable performance on general benchmarks and more aligned responses on safety tests.

Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes