Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks
It addresses security vulnerabilities in LLMs for AI practitioners, identifying strengths and weaknesses in current defenses, though it is incremental as it builds on existing adversarial testing methods.
This study evaluated the resilience of large language models (LLMs) like Flan-T5, BERT, and RoBERTa-Base against adversarial attacks, finding that RoBERTa-Base and Flan-T5 maintained accuracy with 0% attack success rates, while BERT-Base's accuracy dropped from 48% to 3% with a 93.75% success rate.
This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies.