STAR: SocioTechnical Approach to Red Teaming Language Models
This work addresses safety testing for large language models, which is crucial for developers and users, but it appears incremental as it builds on current best practices with specific enhancements.
The paper tackles the problem of red teaming safety for large language models by introducing STAR, a sociotechnical framework that enhances steerability through parameterized instructions and improves signal quality by matching demographics and using arbitration, resulting in improved coverage of the risk surface and more sensitive annotations without increased cost.
This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.