CLApr 18, 2024

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

arXiv:2404.12038v572 citationsh-index: 7Has CodeNIPS
Originality Incremental advance
AI Analysis

This work addresses safety risks for users of LLMs by revealing vulnerabilities, though it is incremental as it builds on existing attack methods with a new interpretation framework.

The authors tackled the problem of safety vulnerabilities in large language models (LLMs) by introducing a Safety Concept Activation Vector (SCAV) framework to guide attacks, resulting in an average attack success rate of 99.14% across seven open-source LLMs.

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes