An Auditing Test To Detect Behavioral Shift in Language Models
This addresses the need for ongoing monitoring of language model behavior to ensure alignment with societal values, though it is incremental as it builds on existing hypothesis testing methods.
The authors tackled the problem of detecting unintended behavioral changes in language models after fine-tuning or deployment by proposing a continual Behavioral Shift Auditing (BSA) test that compares model generations to a baseline. They demonstrated its effectiveness in detecting shifts in toxicity and translation performance using hundreds of examples.
As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.