CRAICLETOct 16, 2024

SoK: Prompt Hacking of Large Language Models

arXiv:2410.13901v114 citationsh-index: 21BigData
Originality Incremental advance
AI Analysis

This work addresses security and reliability issues in LLM-based applications, offering incremental improvements in evaluation methods.

The paper tackles the problem of prompt hacking attacks on large language models (LLMs) by providing a systematic overview of three types (jailbreaking, leaking, injection) and proposing a novel framework that categorizes LLM responses into five classes, improving diagnostic precision for safety and robustness.

The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes