AI CL CY HCJun 17, 2024

STAR: SocioTechnical Approach to Red Teaming Language Models

Laura Weidinger, John Mellor, Bernat Guillen Pegueroles, Nahema Marchal, Ravin Kumar, Kristian Lum, Canfer Akbulut, Mark Diaz, Stevie Bergman, Mikel Rodriguez, Verena Rieser, William Isaac

arXiv:2406.11757v424.229 citations

Originality Incremental advance

AI Analysis

This work addresses safety testing for large language models, which is crucial for developers and users, but it appears incremental as it builds on current best practices with specific enhancements.

The paper tackles the problem of red teaming safety for large language models by introducing STAR, a sociotechnical framework that enhances steerability through parameterized instructions and improves signal quality by matching demographics and using arbitration, resulting in improved coverage of the risk surface and more sensitive annotations without increased cost.

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.

View on arXiv PDF

Similar