CLAIHCLGMay 29, 2025

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

arXiv:2505.23856v17 citationsh-index: 22Has Code
Originality Highly original
AI Analysis

This addresses safety concerns for users of large language models by enhancing detection against attacks in low-resource languages and non-text modalities, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of detecting harmful prompts across languages and modalities in AI safety moderation, improving classification accuracy by 11.57% in multilingual settings, 20.44% for image-based prompts, and setting a new SOTA for audio-based prompts while being about 120 times faster than baselines.

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes