CLMay 23, 2024

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

arXiv:2405.14488v116 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This addresses safety concerns for users deploying open-sourced LLMs in applications, but it is incremental as it builds on existing defense strategies with a novel balancing mechanism.

The paper tackles the problem that existing defense strategies for large language models (LLMs) often lead to rejection-oriented responses, diminishing usability for benign instructions, by introducing the MoGU framework, which uses dynamic routing between safe and usable LLM variants to enhance safety while preserving usability, resulting in safer versions of models like Llama2 and Vicuna.

Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure that responses are harmless. Conversely, for benign instructions, the router prioritizes the usable LLM, facilitating usable and helpful responses. On various open-sourced LLMs, we compare multiple defense strategies to verify the superiority of our MoGU framework. Besides, our analysis provides key insights into the effectiveness of MoGU and verifies that our designed routing mechanism can effectively balance the contribution of each variant by assigning weights. Our work released the safer Llama2, Vicuna, Falcon, Dolphin, and Baichuan2.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes