CLFeb 11

Are Aligned Large Language Models Still Misaligned?

Usman Naseem, Gautam Siddharth Kashyap, Rafiq Ali, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Agrima Seth

arXiv:2602.11305v10.6h-index: 11

Originality Incremental advance

AI Analysis

This addresses the need for simultaneous evaluation of LLM misalignment across multiple dimensions, which is crucial for real-world applications, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem of misalignment in Large Language Models (LLMs) by introducing Mis-Align Bench, a unified benchmark for evaluating misalignment across safety, value, and cultural dimensions simultaneously, showing that single-dimension models achieve up to 97.6% coverage but incur over 50% false failure rates and lower alignment scores of 63%-66% under joint conditions.

Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.

View on arXiv PDF

Similar