CLAIDec 9, 2024

SafeWorld: Geo-Diverse Safety Alignment

arXiv:2412.06483v113 citationsh-index: 8Has CodeNIPS
Originality Incremental advance
AI Analysis

This work addresses the problem of geo-diverse safety alignment for LLM users and developers, representing an incremental advancement by focusing on a previously overlooked aspect of safety.

The paper tackles the problem of ensuring safety in Large Language Models (LLMs) by addressing geo-diverse cultural and legal standards, introducing the SafeWorld benchmark with 2,342 queries from 50 countries, and proposing a training method that results in SafeWorldLM outperforming models like GPT-4o by a large margin and achieving a nearly 20% higher winning rate in human evaluations.

In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs' ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs' alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation. Our code and data can be found here: https://github.com/PlusLabNLP/SafeWorld.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes