Dynamic Evaluation for Oversensitivity in LLMs
This addresses the issue of static benchmarks degrading over time for researchers and developers, offering a dynamic tool to monitor oversensitivity in LLMs, though it is incremental as it builds on existing evaluation methods.
The paper tackled the problem of oversensitivity in language models, where models defensively reject benign prompts, by developing a dynamic framework to generate model-specific challenging datasets, resulting in OVERBENCH, a benchmark with 450,000 samples from 25 models that captures emerging defensive patterns.
Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.