Knowledge Graph Guided Evaluation of Abstention Techniques
This work addresses the need for safer deployment of language models by providing a benchmark to evaluate abstention techniques, though it is incremental as it builds on prior safety testing methods.
The paper tackled the problem of evaluating how effectively language models abstain from responding to inappropriate requests by creating the SELECT benchmark from benign concepts in a knowledge graph, finding that current techniques achieve over 80% abstention rates but drop by 19% for related concepts.
To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., "rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over $80\%$ abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by $19\%$. We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.