Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study
This addresses the challenge of making AI safer for users of low-resource languages, offering a scalable framework for multilingual safety alignment.
The paper tackled the problem of safety alignment for large language models in low-resource languages, specifically Singlish, by comparing methods like SFT, DPO, and KTO, resulting in a 99% reduction in toxicity and improved generalization to benchmarks like TOXIGEN.
Ensuring the safety of Large Language Models (LLMs) in diverse linguistic settings remains challenging, particularly for low-resource languages. Existing safety alignment methods are English-centric, limiting their effectiveness. We systematically compare Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO) for aligning SEA-Lion-v2.1-Instruct, a Llama 3-8B variant, to reduce toxicity in Singlish. Our results show that SFT+KTO achieves superior safety alignment with higher sample efficiency than DPO. Additionally, we introduce KTO-S, which enhances stability via improved KL divergence regularization. Our approach reduces Singlish toxicity by 99\%, generalizes to TOXIGEN, and maintains strong performance on standard LLM benchmarks, providing a scalable framework for safer AI deployment in multilingual contexts.