AICLJun 17, 2024

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

arXiv:2406.15481v333 citations
Originality Incremental advance
AI Analysis

This addresses safety concerns for multilingual LLM users, though it appears incremental as it builds on existing red-teaming methods.

The paper tackles the problem of evaluating LLM safety by discovering that code-switching in red-teaming queries effectively elicits undesirable behaviors, and introduces CSRT, a framework that outperforms existing multilingual techniques with 46.7% more attacks than standard English attacks.

As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching texts. Additionally, we validate the extensibility of the CSRT by generating codeswitching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes