CLApr 12, 2025

Feature-Aware Malicious Output Detection and Mitigation

arXiv:2504.09191v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses safety concerns for users of LLMs by mitigating jailbreak risks, though it is incremental as it builds on existing detection and mitigation techniques.

The paper tackled the problem of LLMs generating malicious content despite fine-tuning, by proposing a feature-aware method that detects harmful features during decoding and adaptively adjusts the model to reject such outputs, achieving effectiveness across multiple models and attacks while preserving standard generation capabilities.

The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes