AIMar 17

MOSAIC: Composable Safety Alignment with Modular Control Tokens

arXiv:2603.1621066.21 citationsh-index: 9
AI Analysis

This addresses the need for flexible safety alignment in LLMs across varying real-world contexts, representing an incremental improvement over existing methods.

The paper tackles the problem of enabling context-dependent safety rules in large language models, proposing MOSAIC, a modular framework that uses learnable control tokens for compositional safety alignment, which achieves strong defense performance with lower over-refusal while preserving model utility.

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes