LLM-Safety Evaluations Lack Robustness
This addresses methodological problems in LLM safety research, which is crucial for researchers and practitioners, but it is incremental as it critiques existing practices rather than introducing new techniques.
The paper identifies that current safety alignment evaluations for large language models suffer from noise and inconsistencies in datasets, methods, and evaluation setups, which hinder fair comparison and slow progress. It proposes guidelines to reduce these issues and improve comparability in future research.
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.