CLDec 22, 2024

Robustness of Large Language Models Against Adversarial Attacks

arXiv:2412.17011v123 citationsh-index: 72024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC)
Originality Synthesis-oriented
AI Analysis

This work addresses the security of LLMs for users in applications, but it is incremental as it applies known attack methods to new models.

The study evaluated the robustness of GPT LLM family against adversarial attacks, revealing significant variations in vulnerability to character-level and semantic-level attacks across sentiment classification datasets.

The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes